Hacker News new | ask | show | jobs
by miraculixx 358 days ago
As any AI researcher knows, if you have a model that does 4x better than the naive baseline (the humans, in this case), you are likely looking at overfit, not real-life performance. This study is just slop, and you can tell so by the mere fact that they did not submit a paper, but just published a PR article.
2 comments

They didn't? What am I looking at, then?

https://arxiv.org/abs/2506.22405

This appears when you click on 'View Publication' in the article near the end, right before Q&A.

In the paper, they say they used the most recent 56 cases (from 2024–2025) as a holdout set. The majority of those cases happened after the o4 training cutoff of May 31, 2024.
Are these 56 cases distinct from all other cases in the data?
Yes. They are about entirely different patient reports.