Hacker News new | ask | show | jobs
by throw83288 493 days ago
Apparently OpenAI's Deep Research already saturated a quarter of this benchmark, more or less a month in. But I also imagine it makes baffling mistakes anyway.

"Humanity's Laster Exam" coming up when?