| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by guilamu 54 days ago
	Just tested it on my homemade Wordpress+GravityForms benchmark and it's one of the worst model of the leaderboard performance wise and the worst value wise: https://github.com/guilamu/llms-wordpress-plugin-benchmark I know it's only on a single benchmark, but I dont understand how it can be so bad...

5 comments

goldenarm 54 days ago

gemma4-e4b is 50% better than gemma4-26b in your benchmark, something's wrong

link

guilamu 54 days ago

Yes those two models were tested on my own PC (local inference using my own CPU/GPU). So something my be bugged on my setup. gemma4-26b should be far better than gemma4-e4b.

link

embedding-shape 54 days ago

Sounds like maybe using worse quantization on the bigger model? Quantization matters a lot for the quality, basically anything below Q8 is borderline unusable. If it isn't specified in a benchmark already it probably should.

link

data-ottawa 54 days ago

The early quants for Gemma4 26b had issues and needed to be updated, might be worth checking

link

Art9681 54 days ago

A junior tinkering in their garage in domains they have little experience executed a flawed test and decided to call it a benchmark. It's extremely common nowadays because words dont mean anything anymore. The forums that used to be filled with technical people doing real work are now filled with the masses of vibe researchers doing this kind of stuff. This is what happens when anything goes over some popularity threshold.

HN is the last bastion of serious inquiry these days. But its not immune as OPs comment proves.

link

guilamu 52 days ago

You're right, I've certainly been a bit presumptuous to call this'a benchmark'. It is indeed a flawed test. Yet,It's been giving me the occasion to try some open source models and for my workflow, some of them are incredibly competitive with sota closed source models.

link

ac29 54 days ago

Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.

link

guilamu 54 days ago

Yes Opus 4.7 fast (no reasoning) did a worst job than Sonnet 4.6 high (with reasoning) according to Gemini 3.1 Pro evaluation.

link

ac29 54 days ago

Your table doesn't indicate reasoning vs non-reasoning, or reasoning level

link

guilamu 54 days ago

When nothing is noted it's max reasoning (xhigh in copilot chat in vscode if available).

The models not availble on copilot were tested through opencode (max reasoning) and deepseek v4 was tested through Cline (with max reasoning too).

link

mosselman 54 days ago

You even traveled in time to deliver us this benchmark.

I really like this benchmarking. Have you evaluated the judge benchmark somehow? I'd love to setup my own similar benchmark.

link

guilamu 54 days ago

Haha, just fixed the date!

I haven't evaluated the judge benchmark. You have everything needed in the repo to do so though, so be my guest. It took me a bit of time to put all this together and won't have much more time to dedicate to it before a couple of weeks.

BTW, if you explore the repo, sorry for all the French files...

link

DrProtic 54 days ago

Seems like benchmark for how good a model is for vibe coding.

Your prompt is extremely slim yet you score it on a bunch of features.

link

guilamu 54 days ago

Yes, the prompt is slim by design. I might be wrong, but the point was to see what the model can do "on it's own".

The eval prompt is quite extensive: https://github.com/guilamu/llms-wordpress-plugin-benchmark/b...

link

DrProtic 54 days ago

That’s the thing, not everyone wants and values the model based on that. But I guess it works for you, and that benchmark achieves it.

I personally develop with very detailed spec, and I don’t want nothing more and nothing less compared to the spec.

I found 5.4/5.5 much better at following spec while Opus makes some things up, which aligns with your benchmark but that makes 5.4/5.5 better for me while worse for you.

link

guilamu 54 days ago

Yeah as I said this a benchmark for my usecase only, a single use case, which is obvisouly not representative of everybody's needs.

What strike me as very strange though is that 0 model were able to just use the search input already present in GravitYForms forms list page and all created a second input.

Also, I know it's not in the prompt, but adding a ctrl+f shortcut to a search input? Is that that crazy? I don't know.

link