|
|
|
|
|
by lossolo
313 days ago
|
|
Implement or retrieve? That’s an important distinction. When evaluating models, you run a variety of tests, and the benchmarks that aren’t publicly disclosed are the most reliable. Your Space Invaders game isn’t really a benchmark of anything, just Google it, and you’ll find plenty of implementations. |
|
Firstly, 12GB is not enough space to hold a copy of anything that large from the training data and just regurgitate it back out again.
You can also watch the thinking traces on the reasoning models and see them piece together the approach they are going to take. Here's an example from the 20B OpenAI model with reasoning set to medium: https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d60...
Illustrative extract:
> Edge detection: aliens leftmost or rightmost position relative to canvas width minus alien width.
> When direction changes, move all aliens down by step (e.g., 10 px).
The benchmarks that aren't publicly disclosed tend to be way simpler than this: things like "What is the embryological origin of the hyoid bone?" (real example from MMLU, it then provides four choices as a multiple-choice challenge).