| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gwd 131 days ago

I feel like a lot of evaluations are pretty clearly evaluations. Not sure how to add the messiness and grit that a real benchmark could have.

That said, apparently Gemini's internal thought process reveals that it thinks loads of things were simulations when they aren't; it's 99% sure news stories about Trump from Dec 2025 are a detailed simulation:

https://www.reddit.com/r/GeminiAI/comments/1qhadce/gemini_is...

ETA: From the article that put me on this:

> I write nonfiction about recent events in AI in a newsletter. According to its CoT while editing, Gemini 3 disagrees about the whole "nonfiction" part:

>> It seems I must treat this as a purely fictional scenario with 2025 as the date. Given that, I'm now focused on editing the text for flow, clarity, and internal consistency.

https://www.lesswrong.com/posts/8uKQyjrAgCcWpfmcs/gemini-3-i...