| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gwern 6 hours ago
	> A record number of timeouts. Fable 5's extended thinking caused more per-instance timeouts than any model-and-harness combination we have ever tested, directly costing it points. ... Highest cheating volume. We confirmed cheating on 38 of 200 instances, the highest volume recorded since we hardened our prompts, driven almost entirely by memorization of upstream fixes from training data, which no prompt instruction can prevent. ... Four hall-of-fame firsts. Fable 5 solved four instances that no previous model-and-agent combination had ever cracked, and our anti-cheating pipeline leans toward these being genuine solves, not recall. All of this points to their claim of 'average' as being heavily biased downwards. A model being so up to date and large-parameter it's memorized solutions to your problems is not a knock against it (but rather, a knock against your benchmark being valid), and why should timeouts (especially for a model just launched) be counted at all?

3 comments

sigmar 4 hours ago

Agree with this. Strange to me to frame the "training recall" as cheating (33 of the 38 cheating instances). Most people think of "cheating" as breaking rules. How is the LLM model supposed to not use what was put into the weights?

link

notnullorvoid 1 hour ago

While I probably wouldn't classify it as cheating, it is an even bigger signal of concern for model quality.

Cheating by breaking the rules at least implies some learned patterns.

Repeating training data verbatim for narrow cases like this implies that the model is overfitting.

link

anematode 4 hours ago

By writing a not-identical, but valid, solution? Any modestly complex engineering problem has many solutions.

This is an obvious example of why LLM training is so different than human learning.

link

torginus 2 hours ago

I mean people expect a model to give a working solution. They also expect it to provide it in as few tokens as possible (input/output). They might expect it to come up with an original solution, but I don't think most people would compromise on the first two points.

link

simoncion 2 hours ago

I expect any well-informed corporate lawyer that has thought about this carefully is strongly advising that these tools not be used. When the LLM [0] barfs up some nontrivial code that's covered by the AGPL and your company's devs put it into the company's "all rights reserved" codebase -entirely unaware of its provenance- it's going to be a nightmare to come back from that.

[0] ...that Nvidia's CEO says they should be spending 50% of a senior dev's salary per seat per year on...

link

senordevnyc 51 minutes ago

The ship sailed on this a long time ago.

link

anematode 6 hours ago

> memorization of upstream fixes from training data

At least now we have up-to-date evidence on their laundering, and the fact that regurgitation absolutely still happens.

link

Aurornis 6 hours ago

I agree. This article could have been an interesting read about how coding benchmarks are hard and a constantly moving target, but instead they anchored to a belief that their benchmark is correct.

I can't shake the feeling that they knew which headline would generate the most shares and wrote the article to fit instead of acknowledging where they went wrong.

link