| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anulum 110 days ago

Hey HN — huge thanks for the thoughtful comments yesterday!

I shipped *v1.2.0* overnight with everything you asked for:

• Full end-to-end benchmark notebook (600+ real RAG/agent traces, HaluEval + TruthfulQA, head-to-head vs Claude self-critique, latency, false positives, recovery rate) → notebooks/04_end_to_end_benchmark.ipynb

• Rich evidence on every halt: top-K conflicting chunks + highlighted NLI premise/hypothesis + distances (now in HaltEvent + dashboard)

• Ready-made graceful fallbacks (soft warning, retrieval-only retry, partial+correction) → examples/graceful_fallbacks.py

• Live Hugging Face Spaces demo (try it yourself): https://huggingface.co/spaces/anulum/director-ai-guardrail

• Full MkDocs site (22 pages), native OpenAI/Anthropic interceptors, score caching, 8-bit NLI, bge-large, LangGraph/Haystack/CrewAI support

Repo: https://github.com/anulum/director-ai Changelog: https://github.com/anulum/director-ai/releases/tag/v1.2.0

Would love feedback on the new bits — especially the end-to-end numbers and graceful patterns. Fire away!

1 comments

anulum 96 days ago

@soletta — you're right, and I've deferred this three times now, which isn't useful. Here's the honest answer: that specific benchmark — frontier LLM alone vs frontier + Director-AI on end-to-end hallucination rate in streaming — doesn't exist yet. I'll have it in the repo within the week, open methodology, raw logs included. But I'll also say clearly: if you're already running Claude Opus 4.6 or GPT-4o, Director-AI probably adds marginal value on self-consistency. Frontier models in 2025/2026 are remarkably coherent within a single response. Where it matters:

- You're running Llama-3.1-70B or a local vLLM stack and can't afford $15/M tokens for a judge call - You need a hard stop with audit trail (regulatory, medical, legal) — not a probabilistic nudge - Your facts live in a private KB that can't go in a context window - You need deterministic, reproducible decisions in prod

I'll run the frontier comparison this week and post results here regardless of how they look. Kind regard Miroslav

anulum 95 days ago

https://anulum.github.io/director-ai/benchmarks/