|
|
|
|
|
by gwd
131 days ago
|
|
I feel like a lot of evaluations are pretty clearly evaluations. Not sure how to add the messiness and grit that a real benchmark could have. That said, apparently Gemini's internal thought process reveals that it thinks loads of things were simulations when they aren't; it's 99% sure news stories about Trump from Dec 2025 are a detailed simulation: https://www.reddit.com/r/GeminiAI/comments/1qhadce/gemini_is... ETA: From the article that put me on this: > I write nonfiction about recent events in AI in a newsletter. According to its CoT while editing, Gemini 3 disagrees about the whole "nonfiction" part: >> It seems I must treat this as a purely fictional scenario with 2025 as the date. Given that, I'm now focused on editing the text for flow, clarity, and internal consistency. https://www.lesswrong.com/posts/8uKQyjrAgCcWpfmcs/gemini-3-i... |
|