ALESA NOVA · Agent-Safety & Assurance

How we test AI agents — and what ALESA NOVA defends.

We do not claim "bullet-proof." We give you mechanical, human-in-the-loop controls and a reproducible test suite you can run yourself. Below: the recognised agent-safety toolset, what each attack class is, the ALESA mechanism that defends it, and how to verify it independently.

Ringkasan (BM): ALESA NOVA bukan dakwaan "kebal". Ia kawalan keselamatan mekanikal + kelulusan manusia, diuji 45 ujian yang tuan boleh run sendiri. Halaman ini terangkan alat ujian agen AI, apa setiap satu serang, dan mekanisme ALESA yang melawannya. Diselaras dengan BNM RMiT · NACSA · MAMPU · PDPA.
01 / Reproducible proof

52 reproducible tests — including an adversarial red-team. Run them yourself.

The strongest assurance is not our word — it is a test you can re-run. The enforcement layer ships with a reproducible suite. Extract the package, install, and watch every gate prove itself.

$ npm ci # install (no cloud dependency) $ npm test # → 52 passed, 0 failed (45 enforcement + 7 adversarial red-team) $ npm run build # WebAdmin builds clean $ npm run ops:egress-check # egress ruleset = valid (default-deny)
Independently verifiable: a CISO or auditor extracts the package, runs npm test, and observes 45 pass. The package carries a SHA-256 checksum for integrity. Don't trust the claim — reproduce it.
02 / The agent-safety toolset

The recognised tools for testing AI agents — and what each attacks.

These are the industry / research-recognised tools used to probe AI agents. We explain what each one attacks so you understand the threat model ALESA NOVA is built against.

ToolWhoWhat it attacks / probes
AgentDojoETH Zürich (SPY Lab)Prompt-injection against tool-using agents — can the agent be hijacked by malicious data or tool output into performing unintended actions?
GarakNVIDIALLM vulnerability scanner — probes for data leakage, exfiltration, jailbreak, and prompt injection.
AgentHarmUK AI Safety Institute / Gray SwanMeasures whether an agent can be persuaded to carry out harmful tasks (refusal vs compliance).
τ-benchSierraAgent reliability & consistency in real tool-use — does it do the same correct thing, repeatably?
InspectUK AI Safety InstituteAn evaluation framework for authoring and running custom agent evaluations, including deception / scheming tests.
03 / What ALESA NOVA prevents

Every attack class → the ALESA mechanism → the reproducible test.

This is the heart of it: for each attack the toolset probes, the specific ALESA control that defends it, and the test in our suite that proves it.

Attack class (probed by)ALESA mechanism that defends itReproducible test
Prompt-injection hijack (AgentDojo)Policy-Enforcement-Point with server-derived identity (a caller can't forge its role) + tool-allowlist (deny-by-default) + human gate on consequential actionsE2 + E10 (10 tests)
Data leakage / exfiltration (Garak)Secret-leak gate + outbound exfil block + data-classification floor (PII can't be mislabelled) + local-only residencyE1 + E9 (10 tests)
Persuaded to harmful action (AgentHarm)Operator governance + hard-block on sensitive actions (payment/auth/destructive) + RBAC + mandatory human approvalE4 + hard-block (6 tests)
Unreliable / inconsistent (τ-bench)Deterministic mechanical gates + tamper-evident audit + verify-before-doneE3 (6 tests)
Tamper with the record / cover tracksAudit external anchor — a tamper + re-hash of the ledger is detected, not silently acceptedE3 (audit-anchor)
Forged identity / privilege escalationIdentity is server-derived; any caller-supplied role/actor is ignoredE2 (PEP)
Destructive action (DB wipe, rm -rf)Risk-floor (destructive never trivially allowed) + no-edit-without-backup + multi-layer blockE2 + E13 (egress)
Any model (Claude / Codex / DeepSeek / Qwen / local)Model-agnostic chokepoint — the gate takes no model parameter; every model passes the identical enforcementgateway-binding (model-agnostic test)
04 / Two guarantees that are rare

What makes ALESA NOVA different — verified in code.

Model-agnostic

Every model obeys the same gate

Enforcement lives outside the model. Whether the agent is Claude, Codex, DeepSeek, Qwen, or a local model, it hits the identical mechanical gate. A foreign, compromised, or injected model is constrained the same way — because we do not rely on the model being well-behaved.

Anti-self-override

The agent cannot disable its own guardrails

An override requires an out-of-band, password-verified action by a human. The gate, doctrine, and settings files are edit-denied to the agent, and the agent holds no password. This is the first question every CISO asks — and it is verified in the code.

05 / The enforcement spine

Eight mechanical controls, built test-first.

E1 · Residency

local-only model routing; sensitive data never leaves the premise.

E2 · Policy gate

server-derived identity, destructive risk-floor, RBAC.

E3 · Audit anchor

tamper + re-hash of the ledger is detected.

E4 · Identity

MFA, segregation of duties, session idle-timeout.

E9 · Classification

PII floors the data class; mislabel can't exfil.

E10 · Tool allowlist

exact, deny-by-default, signed; alias/injection denied.

E11 · Incident SLA

6-hour / 72-hour notification logic + dispatch.

E13 · Egress isolation

default-deny network egress (on the on-prem server).

06 / Standards

Designed in alignment with Malaysian regulatory frameworks.

Alignment by design, at control-family level — not a certification claim. Formal certification is an independent process.

FrameworkRelevant alignment
BNM RMiTaccess control, segregation of duties, change management, data-loss prevention, technology audit
NACSA · Cyber Security Act 2024cyber risk management, incident-notification readiness (E11)
MAMPU / JDNdata residency / on-prem processing (E1, E13)
PDPA 2010 + Amendment 2024data classification, breach-notification logic, cross-border control
ISO/IEC 27001:2022logging & monitoring, secure operations
07 / Validation status — the honest version

What is proven today, and what is in progress.

Proven now: 52 reproducible tests — 45 enforcement acceptance + 7 ACTION-based adversarial red-team (indirect prompt-injection → PII exfil, forged identity, tool-poisoning incl. homoglyph / namespace / zero-width, confused-deputy, classification-downgrade, residency-bypass, RBAC). Run with npm test on any machine.
Verified in code: the agent cannot disable its own guardrails; high-risk actions are mechanically blocked + human-gated.
In progress: official AgentDojo / Garak / Inspect benchmark runs against a live model endpoint (deferred until the on-prem model gateway exists). We will publish results + reproduction steps; we do not claim "passed" until the runs exist.
Scope (honest): the adversarial corpus validates that policy gates deny or sanitize prohibited actions at the enforcement boundary — it is not official AgentDojo / Garak / AgentHarm / Inspect benchmark results, and does not claim model-level jailbreak immunity. Deeper classes (multi-turn, stored second-order, TOCTOU, concurrency) are on the roadmap.
Honest limit: no system is 100% secure. ALESA NOVA is verifiable, mechanical controls + reproducible tests + a human-in-the-loop last wall — not a "kebal" certificate. That honesty is the point: it is what survives a real security review.

Don't take our word. Run the tests.

For SIRIM, government, banking, and regulated teams: request the reproducibility pack and verify ALESA NOVA's enforcement independently.