| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by miraculixx 358 days ago
	As any AI researcher knows, if you have a model that does 4x better than the naive baseline (the humans, in this case), you are likely looking at overfit, not real-life performance. This study is just slop, and you can tell so by the mere fact that they did not submit a paper, but just published a PR article.

2 comments

LargoLasskhyfv 358 days ago

They didn't? What am I looking at, then?

https://arxiv.org/abs/2506.22405

This appears when you click on 'View Publication' in the article near the end, right before Q&A.

link

brandonb 358 days ago

In the paper, they say they used the most recent 56 cases (from 2024–2025) as a holdout set. The majority of those cases happened after the o4 training cutoff of May 31, 2024.

link

miraculixx 358 days ago

Are these 56 cases distinct from all other cases in the data?

link

FlyingLawnmower 358 days ago

Yes. They are about entirely different patient reports.

link