Hacker News new | ask | show | jobs
by anulum 110 days ago
Hey HN — huge thanks for the thoughtful comments yesterday!

I shipped *v1.2.0* overnight with everything you asked for:

• Full end-to-end benchmark notebook (600+ real RAG/agent traces, HaluEval + TruthfulQA, head-to-head vs Claude self-critique, latency, false positives, recovery rate) → notebooks/04_end_to_end_benchmark.ipynb

• Rich evidence on every halt: top-K conflicting chunks + highlighted NLI premise/hypothesis + distances (now in HaltEvent + dashboard)

• Ready-made graceful fallbacks (soft warning, retrieval-only retry, partial+correction) → examples/graceful_fallbacks.py

• Live Hugging Face Spaces demo (try it yourself): https://huggingface.co/spaces/anulum/director-ai-guardrail

• Full MkDocs site (22 pages), native OpenAI/Anthropic interceptors, score caching, 8-bit NLI, bge-large, LangGraph/Haystack/CrewAI support

Repo: https://github.com/anulum/director-ai Changelog: https://github.com/anulum/director-ai/releases/tag/v1.2.0

Would love feedback on the new bits — especially the end-to-end numbers and graceful patterns. Fire away!

1 comments

@soletta — you're right, and I've deferred this three times now, which isn't useful. Here's the honest answer: that specific benchmark — frontier LLM alone vs frontier + Director-AI on end-to-end hallucination rate in streaming — doesn't exist yet. I'll have it in the repo within the week, open methodology, raw logs included. But I'll also say clearly: if you're already running Claude Opus 4.6 or GPT-4o, Director-AI probably adds marginal value on self-consistency. Frontier models in 2025/2026 are remarkably coherent within a single response. Where it matters:

- You're running Llama-3.1-70B or a local vLLM stack and can't afford $15/M tokens for a judge call - You need a hard stop with audit trail (regulatory, medical, legal) — not a probabilistic nudge - Your facts live in a private KB that can't go in a context window - You need deterministic, reproducible decisions in prod

I'll run the frontier comparison this week and post results here regardless of how they look. Kind regard Miroslav