| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by squimmy26 177 days ago
	How certain can we be that these improvements aren't just a result of Gemini 3 Pro pre-training on endless internet writeups of where 2.5 has struggled (and almost certainly what a human would have done instead)? In other words, how much of this improvement is true generalization vs memorization?

3 comments

zurfer 177 days ago

You're too kind. Even the CEO of Google retweeted how well Gemini 2.5 did on Pokemon. There is a high chance that now it's explicitly part of the training regime. We kind of need a different kind of game to know how well it generalizes.

link

kqr 177 days ago

I have a draft doing this with text adventures: https://entropicthoughts.com/updated-llm-benchmark

link

MrCheeze 177 days ago

There were no such writeups, 99% of the discussion about difficulties in Crystal were in twitch and discord chats where Google doesn't scrape. (It hadn't yet gotten the public attention that Claude and Gemini's runs of Pokemon Red and Blue have gotten.)

That said, this writeup itself will probably be scraped and influence Gemini 4.

link

prmoustache 177 days ago

Isn't that the point of a new model anyway?

link

DANmode 177 days ago

Yes. Sort of.

Just don’t confuse it with a random benchmark!

link