|
|
|
|
|
by gwern
6 hours ago
|
|
> A record number of timeouts. Fable 5's extended thinking caused more per-instance timeouts than any model-and-harness combination we have ever tested, directly costing it points. ... Highest cheating volume. We confirmed cheating on 38 of 200 instances, the highest volume recorded since we hardened our prompts, driven almost entirely by memorization of upstream fixes from training data, which no prompt instruction can prevent. ... Four hall-of-fame firsts. Fable 5 solved four instances that no previous model-and-agent combination had ever cracked, and our anti-cheating pipeline leans toward these being genuine solves, not recall. All of this points to their claim of 'average' as being heavily biased downwards. A model being so up to date and large-parameter it's memorized solutions to your problems is not a knock against it (but rather, a knock against your benchmark being valid), and why should timeouts (especially for a model just launched) be counted at all? |
|