| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by unstuck3958 911 days ago
	It's incredible how accurate the Chatbot Arena Leaderboard [0] is at predicting model performance compared to benchmarks (which can and are being gamed, see all the 7B models on HF leaderboard) [0]: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

8 comments

paxys 911 days ago

It's because it isn't "predicting" anything, but rather aggregating user feedback. That is of course going to be closest to judging the subjective "best" model that pleases most people.

It's like saying how can evaluating 5 years of performance at work be better at predicting someone's competency than their SAT scores.

link

coder543 911 days ago

But, what if you could make an SAT that is equivalent to evaluating years of performance at work?

https://huggingface.co/papers/2306.05685

This paper makes the argument that...

"Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain."

So, the Arena could theoretically be automated and achieve similar outcomes. Or at least, it could quickly determine a predicted-ELO for every model, which would be interesting to compare against the human-rated outcomes.

link

infinityio 911 days ago

My understanding was that GPT4 evaluation appeared to specifically favour text that GPT4 would generate itself (leading to some bias towards gpt-based fine-tunes), although I can't remember the details

link

coder543 911 days ago

GPT-4 apparently shows a small bias (10%) towards itself in the paper, and GPT-3.5 apparently did not show any measurable bias towards itself.

Given the possibility of bias, it would make sense to have the judge “recuse” itself from comparisons involving its own output. Between GPT-4, Claude, and soon Gemini Ultra, there should be several strong LLMs to choose from.

I don’t think it would be a replacement for human rating, but it would be interesting to see.

link

coder543 911 days ago

I wish that Arena included a few more "interesting" models like the new Phi-2 model and the current tinyllama model, which are trying to push the limits on small models. Solar-10.7B is another interesting model that seems to be missing, but I just learned about it yesterday, and it seems to have come out a week ago, so maybe it's too new. Solar supposedly outperforms Mixtral-8x7B with a fraction of the total parameters, although Solar seems optimized for single-turn conversation, so maybe it falls apart over multiple messages (I'm not sure).

link

GaggiX 911 days ago

Solar-10.7B is present in the battle arena but there are probably not enough votes for the ranking.

link

unstuck3958 910 days ago

> like the new Phi-2 model

Phi-2 isn't fine tuned for instruction following yet.

link

s-macke 911 days ago

In fact, the performance differences between the models are so significant that even a micro benchmark demonstrates their capabilities.

For example, consider my analysis [0] based on observing the progression of Large Language Models (LLMs) in a single text adventure.

[0] https://github.com/s-macke/AdventureAI#evaluation-of-other-m...

link

nabakin 911 days ago

It's much more accurate than the Open LLM Leaderboard, that's for sure. Human evaluation has always been the gold standard. I just wish we could filter by the votes which were made after only one or two prompts and I hope they don't include the non-blind votes in the results.

link

GaggiX 911 days ago

These are the rules of the battle arena:

-Ask any question to two anonymous models (e.g., ChatGPT, Claude, Llama) and vote for the better one!

-You can continue chatting until you identify a winner.

-Vote won’t be counted if model identity is revealed during conversation.

link

nabakin 910 days ago

Perfect ty!

link

thomasahle 911 days ago

It's not completely blind/anonymous, since you can just ask "What's your name" and the model will identify itself.

Edit: I missed the third rule. I wonder how smart their detection is.

link

GaggiX 911 days ago

That's why the third rule exists.

link

coder543 911 days ago

Why filter out the votes made after only one or two prompts? A lot of times, a single response is all you need to see.

Do you really need more than this to know which one you’re going to pick? https://i.imgur.com/En37EJD.png

Avatar doesn’t have humans? Seriously?

link

nabakin 910 days ago

The thought is, the more a person has used a model, the better they are at evaluating whether or not it is truly worse than another. You can't know if a model is better than another with a sample size of one.

Your test isn't checking for instructions, consistency, logic, just one fact which the model you chose may have gotten right by chance. It's fine assuming you only expect the model to fact check and you don't plan to have a conversation, but if you want more than that, it doesn't work very well.

I'm hoping there are votes in there which can reflect those qualities and filtering by conversation length seems like the easiest way to improve the vote quality a bit.

link

londons_explore 911 days ago

The chart on the bottom left corner of that page shows quite how far ahead the various GPT-4 models are compared to everyone else...

link

_giorgio_ 911 days ago

I've used the Arena a lot, and the differences between models are very clear 90% of the times.

I only make technical (pytorch) questions though.

link

3abiton 911 days ago

Thanks for the reference I was searching for a benchmark that can quantify the typical user experience, as most synthetic ones are completly ineffective. At what sample size the ranking become significant? Or is it baked in the metrics (ELO)?

link

bitshiftfaced 911 days ago

Elo converges on stable scores fairly quickly, depending on the K-factor. I wouldn't think it would be much of an issue at all for something like this, since you can ensure you're testing against every other member (avoiding "Elo islands"). But obviously the more trials the better.

The Glicko rating system is very similar to Elo, but it also models the variance of a given rating. It can directly tell you a "rating deviation."

link

moffkalast 911 days ago

It's astounding that Mixtral Instruct ties with 3.5-turbo while being ~10x smaller.

link

AdrienBrault 911 days ago

3.5-turbo might be 20B, not 10x larger

https://www.reddit.com/r/LocalLLaMA/comments/17jrj82/new_mic...

link

orbital-decay 911 days ago

Let's see... the linked arXiv article has been withdrawn by the author with the following comment:

> Contains inappropriately sourced conjecture of OpenAI's ChatGPT parameter count from this http URL, a citation which was omitted. The authors do not have direct knowledge or verification of this information, and relied solely on this article, which may lead to public confusion

The URL in question: https://www.forbes.com/sites/forbestechcouncil/2023/02/17/is...

This article was written by Aleks Farseev, the CEO of SoMonitor.ai, who makes the claim with no source or explanation:

> ChatGPT is not just smaller (20 billion vs. 175 billion parameters) and therefore faster than GPT-3

link

moffkalast 911 days ago

Hmm right, the ~300B figure may have been for the non-turbo 3.5

link

dannyw 911 days ago

Are you sure it’s 10x smaller? I’d be surprised if OpenAI hasn’t been massively distilling their models.

link