|
|
|
|
|
by anulum
102 days ago
|
|
@soletta — you're right, and I've deferred this three times now, which isn't useful.
Here's the honest answer: that specific benchmark — frontier LLM alone vs frontier + Director-AI on end-to-end hallucination rate in streaming — doesn't exist yet. I'll have it in the repo within the week, open methodology, raw logs included.
But I'll also say clearly: if you're already running Claude Opus 4.6 or GPT-4o, Director-AI probably adds marginal value on self-consistency. Frontier models in 2025/2026 are remarkably coherent within a single response.
Where it matters: - You're running Llama-3.1-70B or a local vLLM stack and can't afford $15/M tokens for a judge call
- You need a hard stop with audit trail (regulatory, medical, legal) — not a probabilistic nudge - Your facts live in a private KB that can't go in a context window
- You need deterministic, reproducible decisions in prod I'll run the frontier comparison this week and post results here regardless of how they look. Kind regard Miroslav |
|