| HN Mirror

That's fair enough, but you should test other frontier model types to see if the benchmark makes sense for them.

For example the shortest path benchmark is largely useless when you look at reasoning models - since they have the equivalent of scratch paper to work through their answers the limitation became their context length rather than any innate ability to reason.