Show HN: Director-AI – token-level NLI+RAG

Y	Hacker News new \| ask \| show \| jobs

Show HN: Director-AI – token-level NLI+RAG (github.com)

2 points by anulum 111 days ago

Hey HN,

After watching too many agents confidently lie in production, I built Director-AI.

It sits between your LLM and the user, scoring every generated token with: • 0.6× DeBERTa-v3 NLI (contradiction detection) • 0.4× RAG against your own ChromaDB knowledge base

If coherence < threshold → Rust kernel halts the stream before the token is sent.

Key technical bits: • Works with any OpenAI-compatible endpoint (Ollama, vLLM, llama.cpp, Groq, OpenAI, Claude…) • StreamingKernel + windowed scoring • GroundTruthStore.add() for easy fact ingestion • Dual licensing: AGPL open + commercial (closed-source/SaaS OK)

Honest AggreFact numbers inside (66.2% balanced acc with streaming enabled). Not claiming SOTA on static NLI — the value is in the live gating + custom KB system.

Repo + full examples: https://github.com/anulum/director-ai

Would love feedback on the scoring weights, halt logic, or kernel design. What hallucination problems are you solving today?

2 comments

anulum 109 days ago

Hey HN — huge thanks for the thoughtful comments yesterday!

I shipped *v1.2.0* overnight with everything you asked for:

• Full end-to-end benchmark notebook (600+ real RAG/agent traces, HaluEval + TruthfulQA, head-to-head vs Claude self-critique, latency, false positives, recovery rate) → notebooks/04_end_to_end_benchmark.ipynb

• Rich evidence on every halt: top-K conflicting chunks + highlighted NLI premise/hypothesis + distances (now in HaltEvent + dashboard)

• Ready-made graceful fallbacks (soft warning, retrieval-only retry, partial+correction) → examples/graceful_fallbacks.py

• Live Hugging Face Spaces demo (try it yourself): https://huggingface.co/spaces/anulum/director-ai-guardrail

• Full MkDocs site (22 pages), native OpenAI/Anthropic interceptors, score caching, 8-bit NLI, bge-large, LangGraph/Haystack/CrewAI support

Repo: https://github.com/anulum/director-ai Changelog: https://github.com/anulum/director-ai/releases/tag/v1.2.0

Would love feedback on the new bits — especially the end-to-end numbers and graceful patterns. Fire away!

link

anulum 96 days ago

@soletta — you're right, and I've deferred this three times now, which isn't useful. Here's the honest answer: that specific benchmark — frontier LLM alone vs frontier + Director-AI on end-to-end hallucination rate in streaming — doesn't exist yet. I'll have it in the repo within the week, open methodology, raw logs included. But I'll also say clearly: if you're already running Claude Opus 4.6 or GPT-4o, Director-AI probably adds marginal value on self-consistency. Frontier models in 2025/2026 are remarkably coherent within a single response. Where it matters:

- You're running Llama-3.1-70B or a local vLLM stack and can't afford $15/M tokens for a judge call - You need a hard stop with audit trail (regulatory, medical, legal) — not a probabilistic nudge - Your facts live in a private KB that can't go in a context window - You need deterministic, reproducible decisions in prod

I'll run the frontier comparison this week and post results here regardless of how they look. Kind regard Miroslav

link

anulum 95 days ago

https://anulum.github.io/director-ai/benchmarks/

link

soletta 111 days ago

Sounds interesting. What makes DeBERTA + RAG any better than detecting contradictions in the context than a frontier LLM, and why? I see that the NLI scorer itself was evaluated, but I’d love to see data about how the full system performs vs SotA if you have any on hand.

link

anulum 111 days ago

@soletta Great question — this is exactly why we built it this way.

*Short answer*: frontier LLMs are excellent at static self-critique, but terrible for *real-time token-by-token streaming guardrails* because of latency, cost, and lack of persistent custom memory.

*Why DeBERTa + RAG wins here*: - *Latency*: DeBERTa-v3-base + Rust kernel scores every ~4 tokens in ~220 ms (AggreFact eval). A frontier LLM call (GPT-4o/Claude 3.5) is 400–2000 ms per check. You can’t do that mid-stream without killing UX. - *Cost*: Frontier self-checking at scale = real money. This runs fully local/offline after the one-time model download. - *Custom knowledge*: The 0.4× RAG weight pulls from your GroundTruthStore (ChromaDB). Frontier models don’t have a live, updatable external fact base unless you keep stuffing context (expensive + context-window limited). - *Determinism & auditability*: Small fine-tunable NLI model + fixed vector DB = reproducible decisions. LLMs-as-judges are stochastic and hard to debug in prod.

We’re completely transparent: the NLI scorer alone is *not SOTA* (66.2% balanced acc on LLM-AggreFact 29k samples — see full table vs MiniCheck/Bespoke/HHEM in the repo). The value is the live system: NLI + user KB + actual streaming halt that no one else ships today.

Full end-to-end comparisons vs. LLM-as-judge in streaming setups are next on the roadmap (happy to run them on any dataset you care about).

Have you tried frontier self-critique in real streaming agents? What broke for you?

Repo benchmarks: https://github.com/anulum/director-ai#benchmarks

link

soletta 110 days ago

I should have been clearer. I'm not talking about making a separate call to the model to ask it to check itself. Any given model essentially is already watching for contradictions all the time as it is generating its output tokens. Frontier models like Claude Opus 4.6 are already exceptionally good at not contradicting themselves as they go. As for not having an external fact base - you could in principle insert content ephemerally into the context that is relevant to the task at hand, though doing this without killing modern prompt caching schemes is challenging.

I saw your benchmarks, what I was asking for is benchmarks of the full system (LLM + the NLI model) vs a frontier LLM on its own. Its fine if you didn't do them, but I think it hurts your case.

link

anulum 110 days ago

@soletta Got it — thanks for the extra clarity, that’s an important distinction.

You’re absolutely right: modern frontier models (Claude 3.5/Opus-class, GPT-4o, etc.) have become extremely good at maintaining internal consistency during autoregressive generation. They rarely contradict themselves within the same response anymore.

Where Director-AI adds unique value is *external grounding + hard enforcement* against a user-owned, persistent knowledge base:

- Your GroundTruthStore (ChromaDB) can be arbitrarily large, versioned, and updated without blowing up context windows or breaking prompt caching. - The guardrail gives a *hard token-level halt* (Rust kernel severs the stream) instead of “hoping” the model self-corrects in the next few tokens. - You get full audit logs: exact NLI score + which facts conflicted. - It lets you pair strong-but-cheaper models (Llama-3.1-70B, Mixtral, local vLLM setups) with enterprise-grade factual reliability.

You’re also correct that we don’t have published head-to-head numbers yet for “frontier LLM alone vs. frontier LLM + Director-AI” on end-to-end hallucination rate in streaming scenarios. The current benchmarks focus on the guardrail component itself (66.2% balanced acc on LLM-AggreFact 29k samples, with full per-dataset breakdown and comparison table vs MiniCheck/Bespoke/HHEM — see README).

That full-system eval is literally next on the roadmap (we’re setting up the scripts this week). If you have a specific domain/dataset where you’d like to see the comparison run, I’d be genuinely happy to do it publicly and share the raw logs/results.

In the meantime, the repo is 100% open (AGPL) — feel free to fork and run your own tests. Would love to hear what you find.

Benchmarks section: https://github.com/anulum/director-ai#benchmarks

link