|
|
|
|
|
by theptip
22 days ago
|
|
Another (IMO fatal) error is they don’t attempt to measure within-model variance. The thing you find when you actually wire up a rigorous eval is that with tool calls like web search you are wide open to infra issues, flakes, and all sorts of non-determinism. They really should be breaking out the numbers for the 3 without search (kinda meaningless for recent factual claims after knowledge cutoff) vs search agents. Lack of a “I don’t know” option completely invalidates results for the non-search models; they are basically guessing what seems like a probable answer, since they don’t know and aren’t allowed to say that. I do agree the forced choice and “weak / strong” variants inflate the headline stat. To make that distinction you need a much more rigorous prompt, likely including ICL examples to illustrate what you mean by “mostly” instead of leaving this to the model to define. |
|