| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by paxys 919 days ago
	It's because it isn't "predicting" anything, but rather aggregating user feedback. That is of course going to be closest to judging the subjective "best" model that pleases most people. It's like saying how can evaluating 5 years of performance at work be better at predicting someone's competency than their SAT scores.

1 comments

coder543 919 days ago

But, what if you could make an SAT that is equivalent to evaluating years of performance at work?

https://huggingface.co/papers/2306.05685

This paper makes the argument that...

"Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain."

So, the Arena could theoretically be automated and achieve similar outcomes. Or at least, it could quickly determine a predicted-ELO for every model, which would be interesting to compare against the human-rated outcomes.

link

infinityio 919 days ago

My understanding was that GPT4 evaluation appeared to specifically favour text that GPT4 would generate itself (leading to some bias towards gpt-based fine-tunes), although I can't remember the details

link

coder543 919 days ago

GPT-4 apparently shows a small bias (10%) towards itself in the paper, and GPT-3.5 apparently did not show any measurable bias towards itself.

Given the possibility of bias, it would make sense to have the judge “recuse” itself from comparisons involving its own output. Between GPT-4, Claude, and soon Gemini Ultra, there should be several strong LLMs to choose from.

I don’t think it would be a replacement for human rating, but it would be interesting to see.

link