| HN Mirror

Is that not worth running the benchmark on to prove or disprove this anyway? This would send a strong signal to google to get their act together and save me from wasting tokens selecting high.

Out of curiosity how are benchmark runs generally funded? It would obviously be great to test them all on all reasoning levels and in and out of their native harnesses. Maybe even in pi / opencode / cursor but I get this would get prohibitively expensive unless you have funding or free tokens.

Thanks for your efforts thus far. Looking forward to seeing more.