| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stavros 927 days ago
	Am I reading it right that performance was roughly comparable with GPT-3.5? How is this even possible?

4 comments

Version467 927 days ago

Not really. They already chose to show the benchmark where it does best and even then it’s still quite a bit worse (though definitely impressive for its size). If you take a look at other benchmarks, for example MMLU@5-shot then this does 46.3, while gpt-3.5 does 70.

But there might be some use cases where this one is close enough in performance and the difference in cost and speed make it a better choice.

link

filterfiber 927 days ago

No it's not (according to their benchmarks).

Zephyr-7B-B still beats it in most benchmarks but it's close.

This model is almost Zephyr-7B-B performance at 3B size which is a lot better for inference requirements.

link

emadm 927 days ago

Yeah got a way to beat 3.5 but it beats most of the first generation llama tunes even guacano 65b

Lots of improvements to go

link

alsodumb 927 days ago

By comparing on benchmarks that are either limited, or have data leaks, or in most cases just don't make sense in terms of usability - I've personally stopped looking at benchmarks to compare models. Personally, if I want to try a new model I hear a lot of chatter about, I use it for a few hours in my daily workflow. My baseline is GPT3.5 and GPT4, and I compare the models with them in terms of my day to day usage.

link

kouteiheika 927 days ago

So in your experience which open model is currently the best?

link

3abiton 926 days ago

The LLM field is still messy at large, if you look at the rankings of model performance, they still do not reflect their usability in real life. I think one major challenge is to find a corresponding benchmark.

link