| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nabakin 911 days ago
	It's much more accurate than the Open LLM Leaderboard, that's for sure. Human evaluation has always been the gold standard. I just wish we could filter by the votes which were made after only one or two prompts and I hope they don't include the non-blind votes in the results.

2 comments

GaggiX 911 days ago

These are the rules of the battle arena:

-Ask any question to two anonymous models (e.g., ChatGPT, Claude, Llama) and vote for the better one!

-You can continue chatting until you identify a winner.

-Vote won’t be counted if model identity is revealed during conversation.

link

nabakin 911 days ago

Perfect ty!

link

thomasahle 911 days ago

It's not completely blind/anonymous, since you can just ask "What's your name" and the model will identify itself.

Edit: I missed the third rule. I wonder how smart their detection is.

link

GaggiX 911 days ago

That's why the third rule exists.

link

coder543 911 days ago

Why filter out the votes made after only one or two prompts? A lot of times, a single response is all you need to see.

Do you really need more than this to know which one you’re going to pick? https://i.imgur.com/En37EJD.png

Avatar doesn’t have humans? Seriously?

link

nabakin 911 days ago

The thought is, the more a person has used a model, the better they are at evaluating whether or not it is truly worse than another. You can't know if a model is better than another with a sample size of one.

Your test isn't checking for instructions, consistency, logic, just one fact which the model you chose may have gotten right by chance. It's fine assuming you only expect the model to fact check and you don't plan to have a conversation, but if you want more than that, it doesn't work very well.

I'm hoping there are votes in there which can reflect those qualities and filtering by conversation length seems like the easiest way to improve the vote quality a bit.

link