| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ssivark 49 days ago

I don't specifically care about Claude -vs- GPT, but comparing models at different amounts of test time compute is a gaping hole. It also means that any unreasonably-expensive token guzzling white-elephant model can top all the benchmarks and still be useless.

What we actually have is like a scaling law for test time compute, so it's silly to focus on specific Y values that someone benchmarked (at whatever default X values). Instead, characterize the slope or power of the scaling law, or just plot the damn curve for each model -vs- number of tokens or cost or something!

Noam Brown also raised this issue recently: https://x.com/polynoamial/status/2064210146558136827

1 comments

swyx 49 days ago

ok i mean i agree, how is it a gaping hole when its literally the second (and third and fourth..) chart on the post? yes token cost and reasoning efficiency is important, hence the 2D pareto charts

link

ssivark 49 days ago

My apologies... I was responding to the above comment / ranting about the general trend and got carried away. Wasn't directed at specifically at your post.

I love your second graph; hope the trend catches on as the main graph, instead of the model-wise bar graph that seems to be popular.

link

swyx 49 days ago

1 dimension is unfortunately all the mental bandwidth that talking heads have.

link