| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by theptip 22 days ago

Another (IMO fatal) error is they don’t attempt to measure within-model variance.

The thing you find when you actually wire up a rigorous eval is that with tool calls like web search you are wide open to infra issues, flakes, and all sorts of non-determinism.

They really should be breaking out the numbers for the 3 without search (kinda meaningless for recent factual claims after knowledge cutoff) vs search agents. Lack of a “I don’t know” option completely invalidates results for the non-search models; they are basically guessing what seems like a probable answer, since they don’t know and aren’t allowed to say that.

I do agree the forced choice and “weak / strong” variants inflate the headline stat. To make that distinction you need a much more rigorous prompt, likely including ICL examples to illustrate what you mean by “mostly” instead of leaving this to the model to define.

1 comments

kostaj 22 days ago

Good idea about publishing intra-model variance data! Will include in the next version. Even if we put aside the two middle buckets (Mostly True and Misleading), that are somewhat subject to interpretation and hedging: On 21% of the claims still at least two models provide polar-opposite verdicts (one model saying True, and another saying False)

link

vlovich123 22 days ago

Of those 21% how many are time-dependent questions that are past the model’s training and requires research to verify? Like the “did Ukraine attack Russian in the past week” question?

link