Hacker News new | ask | show | jobs
by markasoftware 313 days ago
The space invaders game seems like a poor benchmark. Both models understood the prompt and generated valid, functional javascript. One just added more fancy graphics. It might just have "use fancy graphics" in its system prompt for all we know.
1 comments

The way I run these prompts excludes a system prompt - I'm hitting the models directly.
still, if you ask this open model to generate a fancy space invaders game with polish, and then ask the other model to generate a bare-bones space invaders game with the fewest lines of code, I think there's a good chance they'd switch places. This doesn't really test the models ability to generate a space invaders game, so much as it tests their tendency to make an elaborate vs simple solution.
My main goal with that benchmark is to see if it can produce HTML and JavaScript code that runs without errors for a moderately complex challenge.

It's not a comprehensive benchmark - there are many ways you could run it in ways that would be much more informative and robust.

It's great as a quick single sentence prompt to get a feeling for if the model can produce working JavaScript or not.

Not really the other commenters are correct I feel and this is not really proving anything about the fundamental capability of the model. It’s just a hello world benchmark adding no real value, just driving blog traffic for you.
The space invaders benchmark proves that the model can implement a working HTML and JavaScript game from a single prompt. That's a pretty fundamental capability for a model.

Comparing them between models is also kind of interesting, even if it's not a flawlessly robust comparison: https://simonwillison.net/tags/space-invaders/

Implement or retrieve? That’s an important distinction. When evaluating models, you run a variety of tests, and the benchmarks that aren’t publicly disclosed are the most reliable. Your Space Invaders game isn’t really a benchmark of anything, just Google it, and you’ll find plenty of implementations.