Hacker News new | ask | show | jobs
by ammar_x 22 days ago
Absolutely! We need new and better benchmarks like this.

I have a question: why not use the maximum available reasoning on each LLM? For example, I see that Opus 4.7 at `max` reasoning but Sonnet 4.6 at `high`. Wouldn't it be a fairer comparison if all were at max?