|
|
|
|
|
by andrepd
177 days ago
|
|
There was a well-publicised "Claude plays Pokémon" stream where Claude failed to complete Pokemon Blue in spectacular fashion, despite weeks of trying. I think only a very gullible person would assume that future LLMs didn't specifically bake this into their training, as they do for popular benchmarks or for penguins riding a bike. |
|
Just try other random, non-realistic things like “a giraffe walking a tightrope”, “a car sitting at a cafe eating a pizza”, etc.
If the results are dramatically different, then they gamed it. If they are similar in quality, then they probably didn’t.