| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gruez 640 days ago
	>That study very clearly shows 4o, o1-mini and o1-preview's new scores are all within margin error on 4/5 of their new benchmarks. Which figure are you referring to? For instance figure 8a shows a -32.0% accuracy drop when an insignificant change was added to the question. It's unclear how that's "within the margin of error" or "Changing names does not affect the performance of Sota models".

1 comments

famouswaffles 640 days ago

Table 1 in the Appendix. GSM-No-op is the one benchmark that sees significant drops for those 4 models as well (with preview dropping the least at -17%). No-op adds "seemingly relevant but ultimately inconsequential statements". So "change names, performance drops" is decidedly false for today's state of the art.

link

gruez 640 days ago

Thanks. I wrongly focused on the headline result of the paper rather than the specific claim in the comment chain about "changing name, different results".

link

TaylorAlexander 640 days ago

Ah, that’s a good point thanks for the correction.

link