| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anulum 102 days ago

@soletta — you're right, and I've deferred this three times now, which isn't useful. Here's the honest answer: that specific benchmark — frontier LLM alone vs frontier + Director-AI on end-to-end hallucination rate in streaming — doesn't exist yet. I'll have it in the repo within the week, open methodology, raw logs included. But I'll also say clearly: if you're already running Claude Opus 4.6 or GPT-4o, Director-AI probably adds marginal value on self-consistency. Frontier models in 2025/2026 are remarkably coherent within a single response. Where it matters:

- You're running Llama-3.1-70B or a local vLLM stack and can't afford $15/M tokens for a judge call - You need a hard stop with audit trail (regulatory, medical, legal) — not a probabilistic nudge - Your facts live in a private KB that can't go in a context window - You need deterministic, reproducible decisions in prod

I'll run the frontier comparison this week and post results here regardless of how they look. Kind regard Miroslav

1 comments

anulum 101 days ago

https://anulum.github.io/director-ai/benchmarks/

link