Hacker News new | ask | show | jobs
by ssivark 2 days ago
I don't specifically care about Claude -vs- GPT, but comparing models at different amounts of test time compute is a gaping hole. It also means that any unreasonably-expensive token guzzling white-elephant model can top all the benchmarks and still be useless.

What we actually have is like a scaling law for test time compute, so it's silly to focus on specific Y values that someone benchmarked (at whatever default X values). Instead, characterize the slope or power of the scaling law, or just plot the damn curve for each model -vs- number of tokens or cost or something!

Noam Brown also raised this issue recently: https://x.com/polynoamial/status/2064210146558136827

1 comments

ok i mean i agree, how is it a gaping hole when its literally the second (and third and fourth..) chart on the post? yes token cost and reasoning efficiency is important, hence the 2D pareto charts
My apologies... I was responding to the above comment / ranting about the general trend and got carried away. Wasn't directed at specifically at your post.

I love your second graph; hope the trend catches on as the main graph, instead of the model-wise bar graph that seems to be popular.

1 dimension is unfortunately all the mental bandwidth that talking heads have.