Yes those two models were tested on my own PC (local inference using my own CPU/GPU). So something my be bugged on my setup. gemma4-26b should be far better than gemma4-e4b.
Sounds like maybe using worse quantization on the bigger model? Quantization matters a lot for the quality, basically anything below Q8 is borderline unusable. If it isn't specified in a benchmark already it probably should.
A junior tinkering in their garage in domains they have little experience executed a flawed test and decided to call it a benchmark. It's extremely common nowadays because words dont mean anything anymore. The forums that used to be filled with technical people doing real work are now filled with the masses of vibe researchers doing this kind of stuff. This is what happens when anything goes over some popularity threshold.
HN is the last bastion of serious inquiry these days. But its not immune as OPs comment proves.
You're right, I've certainly been a bit presumptuous to call this'a benchmark'. It is indeed a flawed test. Yet,It's been giving me the occasion to try some open source models and for my workflow, some of them are incredibly competitive with sota closed source models.
Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.
I haven't evaluated the judge benchmark. You have everything needed in the repo to do so though, so be my guest. It took me a bit of time to put all this together and won't have much more time to dedicate to it before a couple of weeks.
BTW, if you explore the repo, sorry for all the French files...
That’s the thing, not everyone wants and values the model based on that. But I guess it works for you, and that benchmark achieves it.
I personally develop with very detailed spec, and I don’t want nothing more and nothing less compared to the spec.
I found 5.4/5.5 much better at following spec while Opus makes some things up, which aligns with your benchmark but that makes 5.4/5.5 better for me while worse for you.
Yeah as I said this a benchmark for my usecase only, a single use case, which is obvisouly not representative of everybody's needs.
What strike me as very strange though is that 0 model were able to just use the search input already present in GravitYForms forms list page and all created a second input.
Also, I know it's not in the prompt, but adding a ctrl+f shortcut to a search input? Is that that crazy? I don't know.