Dumb collation of benchmarks that the big labs are essentially training on. Livebench.ai is the industry standard - non contaminated, new questions every few months.
Thanks! Are the scores in some way linear here? As in, if model A is rated at 25 and model B at 50, does that mean I will have half the mistakes with model B? Get answers that are 2x more accurate? Or is it subjective?
It says the best "coding index" is held by Grok 4 and Gemini 2.5 Pro. Give me a break. Nobody uses those models for serious coding. It's dominated by Sonnet 4/Opus 4.1 and GPT-5.