|
|
|
|
|
by wongarsu
2 hours ago
|
|
It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark |
|
it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.