|
|
|
|
|
by simonw
314 days ago
|
|
I see that criticism a lot - that benchmarks like space invaders don't make sense because they're inevitably in the training data - and I don't buy that at all. Firstly, 12GB is not enough space to hold a copy of anything that large from the training data and just regurgitate it back out again. You can also watch the thinking traces on the reasoning models and see them piece together the approach they are going to take. Here's an example from the 20B OpenAI model with reasoning set to medium: https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d60... Illustrative extract: > Edge detection: aliens leftmost or rightmost position relative to canvas width minus alien width. > When direction changes, move all aliens down by step (e.g., 10 px). The benchmarks that aren't publicly disclosed tend to be way simpler than this: things like "What is the embryological origin of the hyoid bone?" (real example from MMLU, it then provides four choices as a multiple-choice challenge). |
|
Add to this that the common crawl slices used for oile/C4 mirror much of what you can find on github. So when the training data contains dozens of near duplicate solutions, the network only needs to interpolate between them.
As to the COT style dumps that you shown, they are easy to misinterpret. Apple’s illusion of thinking paper shows that models will happily backfill plausible sounding rationales that do not correspond to the gradients that actually produced the answer and other evaluation work shows that when you systematically rewrite multiple choice distractors so that memorisation can’t help, accuracy drops by 50-90%, even on "reasoning" models https://arxiv.org/abs/2502.12896 So a cool looking bullet list about "edge detection" could be just narrative overspray, so not really an evidence of algorithmic planning.
If you actually want to know whether a model can plan an arcade game or whatever rather than recall it then you need a real benchmark (metamorphic rewrites, adversarial “none of the others” options etc). Until a benchmark controls for leakage in these ways, a perfect space invaders score mostly shows that the model has good pattern matching for code it has already seen.