Hacker News new | ask | show | jobs
by markasoftware 313 days ago
still, if you ask this open model to generate a fancy space invaders game with polish, and then ask the other model to generate a bare-bones space invaders game with the fewest lines of code, I think there's a good chance they'd switch places. This doesn't really test the models ability to generate a space invaders game, so much as it tests their tendency to make an elaborate vs simple solution.
1 comments

My main goal with that benchmark is to see if it can produce HTML and JavaScript code that runs without errors for a moderately complex challenge.

It's not a comprehensive benchmark - there are many ways you could run it in ways that would be much more informative and robust.

It's great as a quick single sentence prompt to get a feeling for if the model can produce working JavaScript or not.

Not really the other commenters are correct I feel and this is not really proving anything about the fundamental capability of the model. It’s just a hello world benchmark adding no real value, just driving blog traffic for you.
The space invaders benchmark proves that the model can implement a working HTML and JavaScript game from a single prompt. That's a pretty fundamental capability for a model.

Comparing them between models is also kind of interesting, even if it's not a flawlessly robust comparison: https://simonwillison.net/tags/space-invaders/

Implement or retrieve? That’s an important distinction. When evaluating models, you run a variety of tests, and the benchmarks that aren’t publicly disclosed are the most reliable. Your Space Invaders game isn’t really a benchmark of anything, just Google it, and you’ll find plenty of implementations.
I see that criticism a lot - that benchmarks like space invaders don't make sense because they're inevitably in the training data - and I don't buy that at all.

Firstly, 12GB is not enough space to hold a copy of anything that large from the training data and just regurgitate it back out again.

You can also watch the thinking traces on the reasoning models and see them piece together the approach they are going to take. Here's an example from the 20B OpenAI model with reasoning set to medium: https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d60...

Illustrative extract:

> Edge detection: aliens leftmost or rightmost position relative to canvas width minus alien width.

> When direction changes, move all aliens down by step (e.g., 10 px).

The benchmarks that aren't publicly disclosed tend to be way simpler than this: things like "What is the embryological origin of the hyoid bone?" (real example from MMLU, it then provides four choices as a multiple-choice challenge).

12.8 GB is around 110 Gbits. Even at 4.25 bits/weight the network stores ~26 billion "micro weights". A 1,4k token space invaders snippet occupies ~1.1 kb compressed, the model could parametrize thousands of such snippets and still have more than 99% of its capacity left. This paper about LLM memorization is interesting, if you would to know more: https://arxiv.org/abs/2312.11658 and another recent interesting paper SWE bench illusion shows SOTA code LLM results collapsing once memorised github issues are filtered out: https://arxiv.org/pdf/2506.12286v1

Add to this that the common crawl slices used for oile/C4 mirror much of what you can find on github. So when the training data contains dozens of near duplicate solutions, the network only needs to interpolate between them.

As to the COT style dumps that you shown, they are easy to misinterpret. Apple’s illusion of thinking paper shows that models will happily backfill plausible sounding rationales that do not correspond to the gradients that actually produced the answer and other evaluation work shows that when you systematically rewrite multiple choice distractors so that memorisation can’t help, accuracy drops by 50-90%, even on "reasoning" models https://arxiv.org/abs/2502.12896 So a cool looking bullet list about "edge detection" could be just narrative overspray, so not really an evidence of algorithmic planning.

If you actually want to know whether a model can plan an arcade game or whatever rather than recall it then you need a real benchmark (metamorphic rewrites, adversarial “none of the others” options etc). Until a benchmark controls for leakage in these ways, a perfect space invaders score mostly shows that the model has good pattern matching for code it has already seen.