| @soletta Got it — thanks for the extra clarity, that’s an important distinction. You’re absolutely right: modern frontier models (Claude 3.5/Opus-class, GPT-4o, etc.) have become extremely good at maintaining internal consistency during autoregressive generation. They rarely contradict themselves within the same response anymore. Where Director-AI adds unique value is *external grounding + hard enforcement* against a user-owned, persistent knowledge base: - Your GroundTruthStore (ChromaDB) can be arbitrarily large, versioned, and updated without blowing up context windows or breaking prompt caching.
- The guardrail gives a *hard token-level halt* (Rust kernel severs the stream) instead of “hoping” the model self-corrects in the next few tokens.
- You get full audit logs: exact NLI score + which facts conflicted.
- It lets you pair strong-but-cheaper models (Llama-3.1-70B, Mixtral, local vLLM setups) with enterprise-grade factual reliability. You’re also correct that we don’t have published head-to-head numbers yet for “frontier LLM alone vs. frontier LLM + Director-AI” on end-to-end hallucination rate in streaming scenarios. The current benchmarks focus on the guardrail component itself (66.2% balanced acc on LLM-AggreFact 29k samples, with full per-dataset breakdown and comparison table vs MiniCheck/Bespoke/HHEM — see README). That full-system eval is literally next on the roadmap (we’re setting up the scripts this week). If you have a specific domain/dataset where you’d like to see the comparison run, I’d be genuinely happy to do it publicly and share the raw logs/results. In the meantime, the repo is 100% open (AGPL) — feel free to fork and run your own tests. Would love to hear what you find. Benchmarks section: https://github.com/anulum/director-ai#benchmarks |