Hacker News new | ask | show | jobs
by ekojs 430 days ago
I think it's most illustrative to see the sample battles (H2H) that LMArena released [1]. The outputs of Meta's model is too verbose and too 'yappy' IMO. And looking at the verdicts, it's no wonder by people are discounting LMArena rankings.

[1]: https://huggingface.co/spaces/lmarena-ai/Llama-4-Maverick-03...

2 comments

In fairness, 4o was like this until very recently. I suspect it comes from training on COT data from larger models.
Yep, it’s clear that many wins are due to Llama 4’s lowered refusal rate which is an effective form of elo hacking.