| Hey HN — huge thanks for the thoughtful comments yesterday! I shipped *v1.2.0* overnight with everything you asked for: • Full end-to-end benchmark notebook (600+ real RAG/agent traces, HaluEval + TruthfulQA, head-to-head vs Claude self-critique, latency, false positives, recovery rate) → notebooks/04_end_to_end_benchmark.ipynb • Rich evidence on every halt: top-K conflicting chunks + highlighted NLI premise/hypothesis + distances (now in HaltEvent + dashboard) • Ready-made graceful fallbacks (soft warning, retrieval-only retry, partial+correction) → examples/graceful_fallbacks.py • Live Hugging Face Spaces demo (try it yourself): https://huggingface.co/spaces/anulum/director-ai-guardrail • Full MkDocs site (22 pages), native OpenAI/Anthropic interceptors, score caching, 8-bit NLI, bge-large, LangGraph/Haystack/CrewAI support Repo: https://github.com/anulum/director-ai
Changelog: https://github.com/anulum/director-ai/releases/tag/v1.2.0 Would love feedback on the new bits — especially the end-to-end numbers and graceful patterns. Fire away! |
- You're running Llama-3.1-70B or a local vLLM stack and can't afford $15/M tokens for a judge call - You need a hard stop with audit trail (regulatory, medical, legal) — not a probabilistic nudge - Your facts live in a private KB that can't go in a context window - You need deterministic, reproducible decisions in prod
I'll run the frontier comparison this week and post results here regardless of how they look. Kind regard Miroslav