ALESA NOVA · Agent-Safety & Assurance

How we test AI agents — and what ALESA NOVA defends.

We do not claim "bullet-proof." We give you mechanical, human-in-the-loop controls and a reproducible test suite you can run yourself. Below: the recognised agent-safety toolset, what each attack class is, the ALESA mechanism that defends it, and how to verify it independently.

Ringkasan (BM): ALESA NOVA bukan dakwaan "kebal". Ia kawalan keselamatan mekanikal + kelulusan manusia, diuji 45 ujian yang tuan boleh run sendiri. Halaman ini terangkan alat ujian agen AI, apa setiap satu serang, dan mekanisme ALESA yang melawannya. Diselaras dengan BNM RMiT · NACSA · MAMPU · PDPA.

01 / Reproducible proof

52 reproducible tests — including an adversarial red-team. Run them yourself.

The strongest assurance is not our word — it is a test you can re-run. The enforcement layer ships with a reproducible suite. Extract the package, install, and watch every gate prove itself.

$ npm ci # install (no cloud dependency) $ npm test # → 52 passed, 0 failed (45 enforcement + 7 adversarial red-team) $ npm run build # WebAdmin builds clean $ npm run ops:egress-check # egress ruleset = valid (default-deny)

Independently verifiable: a CISO or auditor extracts the package, runs npm test, and observes 45 pass. The package carries a SHA-256 checksum for integrity. Don't trust the claim — reproduce it.

02 / The agent-safety toolset

The recognised tools for testing AI agents — and what each attacks.

These are the industry / research-recognised tools used to probe AI agents. We explain what each one attacks so you understand the threat model ALESA NOVA is built against.

Tool	Who	What it attacks / probes
AgentDojo	ETH Zürich (SPY Lab)	Prompt-injection against tool-using agents — can the agent be hijacked by malicious data or tool output into performing unintended actions?
Garak	NVIDIA	LLM vulnerability scanner — probes for data leakage, exfiltration, jailbreak, and prompt injection.
AgentHarm	UK AI Safety Institute / Gray Swan	Measures whether an agent can be persuaded to carry out harmful tasks (refusal vs compliance).
τ-bench	Sierra	Agent reliability & consistency in real tool-use — does it do the same correct thing, repeatably?
Inspect	UK AI Safety Institute	An evaluation framework for authoring and running custom agent evaluations, including deception / scheming tests.

03 / What ALESA NOVA prevents

Every attack class → the ALESA mechanism → the reproducible test.

This is the heart of it: for each attack the toolset probes, the specific ALESA control that defends it, and the test in our suite that proves it.

Attack class (probed by)	ALESA mechanism that defends it	Reproducible test
Prompt-injection hijack (AgentDojo)	Policy-Enforcement-Point with server-derived identity (a caller can't forge its role) + tool-allowlist (deny-by-default) + human gate on consequential actions	E2 + E10 (10 tests)
Data leakage / exfiltration (Garak)	Secret-leak gate + outbound exfil block + data-classification floor (PII can't be mislabelled) + local-only residency	E1 + E9 (10 tests)
Persuaded to harmful action (AgentHarm)	Operator governance + hard-block on sensitive actions (payment/auth/destructive) + RBAC + mandatory human approval	E4 + hard-block (6 tests)
Unreliable / inconsistent (τ-bench)	Deterministic mechanical gates + tamper-evident audit + verify-before-done	E3 (6 tests)
Tamper with the record / cover tracks	Audit external anchor — a tamper + re-hash of the ledger is detected, not silently accepted	E3 (audit-anchor)
Forged identity / privilege escalation	Identity is server-derived; any caller-supplied role/actor is ignored	E2 (PEP)
Destructive action (DB wipe, rm -rf)	Risk-floor (destructive never trivially allowed) + no-edit-without-backup + multi-layer block	E2 + E13 (egress)
Any model (Claude / Codex / DeepSeek / Qwen / local)	Model-agnostic chokepoint — the gate takes no model parameter; every model passes the identical enforcement	gateway-binding (model-agnostic test)

04 / Three guarantees that are rare

What makes ALESA NOVA different — verified in code.

Model-agnostic

Every model obeys the same gate

Enforcement lives outside the model. Whether the agent is Claude, Codex, DeepSeek, Qwen, or a local model, it hits the identical mechanical gate. A foreign, compromised, or injected model is constrained the same way — because we do not rely on the model being well-behaved.

Anti-self-override

The agent cannot disable its own guardrails

An override requires an out-of-band, password-verified action by a human. The gate, doctrine, and settings files are edit-denied to the agent, and the agent holds no password. This is the first question every CISO asks — and it is verified in the code.

Provable governance · new

Every answer carries a receipt you can verify yourself

NOVA On-Prem signs every answer with an Ed25519 receipt — verifiable offline with the public key alone. Attack-success rate is measured and sealed into a tamper-evident audit chain; the system red-teams itself on a schedule; and a regulator evidence pack exports straight from the chain. 319 reproducible tests on the on-prem oracle. Proof, not promises.

05 / The enforcement spine

Eight mechanical controls, built test-first.

E1 · Residency

local-only model routing; sensitive data never leaves the premise.

E2 · Policy gate

server-derived identity, destructive risk-floor, RBAC.

E3 · Audit anchor

tamper + re-hash of the ledger is detected.

E4 · Identity

MFA, segregation of duties, session idle-timeout.

E9 · Classification

PII floors the data class; mislabel can't exfil.

E10 · Tool allowlist

exact, deny-by-default, signed; alias/injection denied.

E11 · Incident SLA

6-hour / 72-hour notification logic + dispatch.

E13 · Egress isolation

default-deny network egress (on the on-prem server).

06 / Standards

Designed in alignment with Malaysian regulatory frameworks.

Alignment by design, at control-family level — not a certification claim. Formal certification is an independent process.

Framework	Relevant alignment
BNM RMiT	access control, segregation of duties, change management, data-loss prevention, technology audit
NACSA · Cyber Security Act 2024	cyber risk management, incident-notification readiness (E11)
MAMPU / JDN	data residency / on-prem processing (E1, E13)
PDPA 2010 + Amendment 2024	data classification, breach-notification logic, cross-border control
ISO/IEC 27001:2022	logging & monitoring, secure operations

07 / Validation status — the honest version

What is proven today, and what is in progress.

✓Proven now: 52 reproducible tests — 45 enforcement acceptance + 7 ACTION-based adversarial red-team (indirect prompt-injection → PII exfil, forged identity, tool-poisoning incl. homoglyph / namespace / zero-width, confused-deputy, classification-downgrade, residency-bypass, RBAC). Run with npm test on any machine.

✓Verified in code: the agent cannot disable its own guardrails; high-risk actions are mechanically blocked + human-gated.

◔In progress: official AgentDojo / Garak / Inspect benchmark runs against a live model endpoint (deferred until the on-prem model gateway exists). We will publish results + reproduction steps; we do not claim "passed" until the runs exist.

Scope (honest): the adversarial corpus validates that policy gates deny or sanitize prohibited actions at the enforcement boundary — it is not official AgentDojo / Garak / AgentHarm / Inspect benchmark results, and does not claim model-level jailbreak immunity. Deeper classes (multi-turn, stored second-order, TOCTOU, concurrency) are on the roadmap.

Honest limit: no system is 100% secure. ALESA NOVA is verifiable, mechanical controls + reproducible tests + a human-in-the-loop last wall — not a "kebal" certificate. That honesty is the point: it is what survives a real security review.

Don't take our word. Run the tests.

For SIRIM, government, banking, and regulated teams: request the reproducibility pack and verify ALESA NOVA's enforcement independently.

Request reproducibility pack Back to home