| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zamadatix 421 days ago
	I think the only way to be particularly impressed with new leading models lately is to hold the opinion all of the benchmarks are inaccurate and/or irrelevant and it's vibes/anecdotes where the model is really light years ahead. Otherwise you look at the numbers on e.g. lmarena and see it's claiming a ~16% preference win rate for gpt-3.5-turbo from November of 2023 over this new world-leading model from Google.

2 comments

johnfn 421 days ago

Not sure I follow - Gemini has ELO 1470, GPT3.5-turbo is 1206, which is an 86% win rate. https://chatgpt.com/share/6841f69d-b2ec-800c-9f8c-3e802ebbc0...

link

zamadatix 420 days ago

gpt-3.5-turbo-1106 from November 2023 was 1170, 1206 is for the March variant.

Change that and you get ~84%, flip the order (i.e. the win rate of GPT-3.5 is ~16%). I.e. the point is a two year old model still wins far too often to be excited about each new top model for the last two years, not that the two year old model is better.

link

Workaccount2 421 days ago

People can ask whatever they want on LMarena, so a question like "List some good snacks to bring to work" might elicit a win for a old/tiny/deprecated model simply because it lists the snack the user liked more.

link

AstroBen 421 days ago

are you saying that's a bad way to judge a model? Not sure why we'd want ones that choose bad snacks

link