Hacker News new | ask | show | jobs
by JacobAsmuth 16 days ago
I wonder why they didn't test Gemini 3.5 Flash (High).
1 comments

in small scale testing we found high effort on gemini 3.5 flash caused it to over think, generating large amounts of tokens without a substantive improve in performance.
Is that not worth running the benchmark on to prove or disprove this anyway? This would send a strong signal to google to get their act together and save me from wasting tokens selecting high.

Out of curiosity how are benchmark runs generally funded? It would obviously be great to test them all on all reasoning levels and in and out of their native harnesses. Maybe even in pi / opencode / cursor but I get this would get prohibitively expensive unless you have funding or free tokens.

Thanks for your efforts thus far. Looking forward to seeing more.