Hacker News new | ask | show | jobs
by scrollop 43 days ago
Obscene levels of hallucinations, the worst of LLMs, unfortunately.

Deepseek v4 pro 94%

Deepseek v4 flash - 96%

https://artificialanalysis.ai/evaluations/omniscience?models...

3 comments

Personally, I'm not bothered very much by LLM confabulation, as long as it's the result of missing context. In most practical tasks, we either give context to the model, or tell it to find it itself using the internet. What I am concerned with is confabulation that contradicts available in-context information, but that doesn't seem to be what is measured here.
This must be easily benchmaxed because I have never gotten an "idk like" answer for the western frontier models. All my personal "real world" use cases will always resort to hallucinations.
The output of any LLM is always 100% hallucination by principle. On top of that, most benchmarks are at best an approximation of LLM quality. Your use case decides which one to use. That said, I haven't tested v4 yet but the old 3.2 is still a decent model. And concerning use cases, I had coding problems that Opus couldn't solve but a local 35B model did.

All the talk about frontier and SOTA is do dig deeper and deeper into the pockets of VCs and finally do an IPO.