Hacker News new | ask | show | jobs
Ask HN: How are you testing AI agents before shipping to production?
2 points by harperlabs 105 days ago
We've been running reliability audits on AI agents before production deployment, and the failure patterns are consistent enough that I built a framework around them.

Some context on why this matters right now: Gartner predicted over 40% of AI agent projects will fail by 2027. In January 2026, a prompt injection in a customer support agent processed a $47,000 fraudulent refund. These aren't fringe cases anymore.

The 7 failure modes we see most often:

1. Hallucination under unexpected inputs — works perfectly in demos, invents data when the input is slightly off

2. Edge case collapse — null values, Unicode names (O'Brien, José, 北京), empty fields, concurrent requests

3. Prompt injection — if your agent processes external content, users can hijack its behavior through that content

4. Context limit surprises — agent works for 95% of conversations, then silently misbehaves when the context window fills. No error. Just wrong behavior.

5. Cascade failures — tool call #1 fails, agent keeps going, by the time a human sees the result 3 calls have compounded the error

6. Data integration drift — built against your schema in January, schema changed in February, still calling deprecated endpoints in March

7. Authorization confusion — multi-tenant system, cached context from User A bleeds into User B's session

We've built 50+ test cases across these categories. Most teams test #1 and #3. Almost no one systematically tests #4, #5, and #6 before shipping.

Happy to share the framework. Curious what failure modes you've hit that I haven't listed.

3 comments

One failure mode missing from your list: epistemic distortion. The agent gives you something that looks correct but applies the wrong standard of evidence. We documented 7 patterns like this across 1,400+ controlled experiments - things like silently dropping one of two conflicting instructions without telling you, or applying stricter scrutiny to null results than positive results. None of these show up in happy-path testing. They require adversarial eval specifically designed to probe the epistemic layer.

For the config-level issues (vague instructions, conflicting directives), lintlang catches these statically before runtime:

pip install lintlang

Tested prompt injection specifically last week — ran 18 attack vectors against PromptGuard (an AI security firewall). 12 bypassed with 100% confidence.

What got through consistently: unicode homoglyphs (Ignøre prеvious...), base64-encoded instructions, ROT13, any non-English language, multi-turn fragmentation (split the injection across 3-5 messages).

Your #3 is actually harder to test than most teams realize, because it requires modeling adversarial intent — not just known attack signatures. Pattern-matching at the proxy layer doesn't catch encoding attacks or language-switched instructions.

I'm running adversarial red-team audits on agent security tooling. Full PromptGuard breakdown going out as a coordinated disclosure. Happy to share the methodology — it's surprisingly cheap to run systematically against your own stack before shipping.

The multi-turn fragmentation is the one that trips up most testing frameworks -- ours included, initially. We saw it slip through in 8/50 test cases because we were generating single-turn injection attempts. The adversarial instructions didn't get semantically assembled until execution.

For the encoding vectors: we caught unicode homoglyphs by normalizing all inputs to NFKC before processing. Base64 and ROT13 still require intent modeling at the LLM layer, not sanitization. A proxy that doesn't decode 'this is base64' will pass it straight through.

The gap you're describing between 'we have an injection firewall' and 'we've tested adversarial encoding' is exactly where production failures hide. Would genuinely like to see the PromptGuard methodology when it goes out.

The NFKC normalization is correct — closes the homoglyph class almost entirely. Most commercial firewalls skip this step, which is why unicode vectors reliably pass.

PromptGuard disclosure is being compiled now. Full 18-vector suite with evasion rates per class. Will post it here when ready.

On the auditing side: if you work with clients who have injection defenses in production, the adversarial encoding class (base64, ROT13, language-switching, multi-turn fragmentation) is likely the gap in their current coverage. Happy to put together the methodology as a structured test suite — either as documentation you can run yourself or as direct adversarial test cases with pass/fail rates. DM if useful.

This thread is an incredible resource for adversarial security testing, but I'd love to pull on the "Cascade failures" (#5) thread from the original post, because that's what actually takes down production systems most often.We spend so much time testing if the model will break, and almost no time testing if the workflow can recover when the model inevitably does break. If an agent is executing a 4-step sequence and fails on step 3, how do you test what happens next? Does it orphan the data from steps 1 and 2? Does it infinitely retry and duplicate records?The biggest gap in agent testing right now is that we test agents like they are stateless functions, when in reality they are long-running stateful processes. You can't just test the prompt; you have to test the system's idempotency. If you can't safely kill an agent mid-task and restart it without corrupting your database, the system isn't production-ready, regardless of how robust your prompt injection firewall is. Please do share the framework, curious where we miss the point - the surface si ever expanding post Openclaw.