Hacker News new | ask | show | jobs
by simonw 313 days ago
My main goal with that benchmark is to see if it can produce HTML and JavaScript code that runs without errors for a moderately complex challenge.

It's not a comprehensive benchmark - there are many ways you could run it in ways that would be much more informative and robust.

It's great as a quick single sentence prompt to get a feeling for if the model can produce working JavaScript or not.

1 comments

Not really the other commenters are correct I feel and this is not really proving anything about the fundamental capability of the model. It’s just a hello world benchmark adding no real value, just driving blog traffic for you.
The space invaders benchmark proves that the model can implement a working HTML and JavaScript game from a single prompt. That's a pretty fundamental capability for a model.

Comparing them between models is also kind of interesting, even if it's not a flawlessly robust comparison: https://simonwillison.net/tags/space-invaders/

Implement or retrieve? That’s an important distinction. When evaluating models, you run a variety of tests, and the benchmarks that aren’t publicly disclosed are the most reliable. Your Space Invaders game isn’t really a benchmark of anything, just Google it, and you’ll find plenty of implementations.
I see that criticism a lot - that benchmarks like space invaders don't make sense because they're inevitably in the training data - and I don't buy that at all.

Firstly, 12GB is not enough space to hold a copy of anything that large from the training data and just regurgitate it back out again.

You can also watch the thinking traces on the reasoning models and see them piece together the approach they are going to take. Here's an example from the 20B OpenAI model with reasoning set to medium: https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d60...

Illustrative extract:

> Edge detection: aliens leftmost or rightmost position relative to canvas width minus alien width.

> When direction changes, move all aliens down by step (e.g., 10 px).

The benchmarks that aren't publicly disclosed tend to be way simpler than this: things like "What is the embryological origin of the hyoid bone?" (real example from MMLU, it then provides four choices as a multiple-choice challenge).

12.8 GB is around 110 Gbits. Even at 4.25 bits/weight the network stores ~26 billion "micro weights". A 1,4k token space invaders snippet occupies ~1.1 kb compressed, the model could parametrize thousands of such snippets and still have more than 99% of its capacity left. This paper about LLM memorization is interesting, if you would to know more: https://arxiv.org/abs/2312.11658 and another recent interesting paper SWE bench illusion shows SOTA code LLM results collapsing once memorised github issues are filtered out: https://arxiv.org/pdf/2506.12286v1

Add to this that the common crawl slices used for oile/C4 mirror much of what you can find on github. So when the training data contains dozens of near duplicate solutions, the network only needs to interpolate between them.

As to the COT style dumps that you shown, they are easy to misinterpret. Apple’s illusion of thinking paper shows that models will happily backfill plausible sounding rationales that do not correspond to the gradients that actually produced the answer and other evaluation work shows that when you systematically rewrite multiple choice distractors so that memorisation can’t help, accuracy drops by 50-90%, even on "reasoning" models https://arxiv.org/abs/2502.12896 So a cool looking bullet list about "edge detection" could be just narrative overspray, so not really an evidence of algorithmic planning.

If you actually want to know whether a model can plan an arcade game or whatever rather than recall it then you need a real benchmark (metamorphic rewrites, adversarial “none of the others” options etc). Until a benchmark controls for leakage in these ways, a perfect space invaders score mostly shows that the model has good pattern matching for code it has already seen.

If the models are memorizing and regurgitating from their training data, how come every model I've tried this with produces entirely different code?

Presumably this is because "the network only needs to interpolate between them". That's what I want it to do!

I tried the space invaders thing on a 4GB Qwen model today and it managed to produce a grid of aliens that advanced one step... and then dropped off the page entirely.