Hacker News new | ask | show | jobs
by noddybear 456 days ago
This is true - there are simpler benchmarks that can saturate planning for these models. We were motivated to create a broader spectrum eval, to test multiple capabilities at once and remain viable into the future.
1 comments

That's fair enough, but you should test other frontier model types to see if the benchmark makes sense for them.

For example the shortest path benchmark is largely useless when you look at reasoning models - since they have the equivalent of scratch paper to work through their answers the limitation became their context length rather than any innate ability to reason.