| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by amrrs 53 days ago
	Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!

3 comments

minimaxir 53 days ago

In a sense it's better than a benchmark: it's a practical, real-world, highly quantifiable improvement assuming there are no quality regressions and passes all test cases. I have been experimenting with this workflow across a variety of computational domains and have achieved consistent results with both Opus and GPT. My coworkers have independently used Opus for optimization suggestions on services in prod and they've led to much better performance (3x in some cases).

A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).

link

squibonpig 53 days ago

Yeah but like what if they're sorta embellishing it or just lying? That's the issue with not being reproducible.

link

theptip 52 days ago

The tension here is that what customers need to reproduce is this result on their own problem. To measure this you need extensive evals on private data.

OpenAI simply won’t share the data you need to reproduce this in the way you’d hope for an academic paper.

It’s an engineering result, not a scientific one.

link

jstanley 53 days ago

Oh, come on, if they do well on benchmarks people question how applicable they are in reality. If they do well in reality people complain that it's not a reproducible benchmark...

link

girvo 53 days ago

That's easily explained by those being two different people with two different opinions?

link

2goomba1stage 52 days ago

And together they make one single community that s effectively NEVER happy.

link