|
|
|
|
|
by soletta
115 days ago
|
|
I should have been clearer. I'm not talking about making a separate call to the model to ask it to check itself. Any given model essentially is already watching for contradictions all the time as it is generating its output tokens. Frontier models like Claude Opus 4.6 are already exceptionally good at not contradicting themselves as they go. As for not having an external fact base - you could in principle insert content ephemerally into the context that is relevant to the task at hand, though doing this without killing modern prompt caching schemes is challenging. I saw your benchmarks, what I was asking for is benchmarks of the full system (LLM + the NLI model) vs a frontier LLM on its own. Its fine if you didn't do them, but I think it hurts your case. |
|
You’re absolutely right: modern frontier models (Claude 3.5/Opus-class, GPT-4o, etc.) have become extremely good at maintaining internal consistency during autoregressive generation. They rarely contradict themselves within the same response anymore.
Where Director-AI adds unique value is *external grounding + hard enforcement* against a user-owned, persistent knowledge base:
- Your GroundTruthStore (ChromaDB) can be arbitrarily large, versioned, and updated without blowing up context windows or breaking prompt caching. - The guardrail gives a *hard token-level halt* (Rust kernel severs the stream) instead of “hoping” the model self-corrects in the next few tokens. - You get full audit logs: exact NLI score + which facts conflicted. - It lets you pair strong-but-cheaper models (Llama-3.1-70B, Mixtral, local vLLM setups) with enterprise-grade factual reliability.
You’re also correct that we don’t have published head-to-head numbers yet for “frontier LLM alone vs. frontier LLM + Director-AI” on end-to-end hallucination rate in streaming scenarios. The current benchmarks focus on the guardrail component itself (66.2% balanced acc on LLM-AggreFact 29k samples, with full per-dataset breakdown and comparison table vs MiniCheck/Bespoke/HHEM — see README).
That full-system eval is literally next on the roadmap (we’re setting up the scripts this week). If you have a specific domain/dataset where you’d like to see the comparison run, I’d be genuinely happy to do it publicly and share the raw logs/results.
In the meantime, the repo is 100% open (AGPL) — feel free to fork and run your own tests. Would love to hear what you find.
Benchmarks section: https://github.com/anulum/director-ai#benchmarks