Hacker News new | ask | show | jobs
by stavros 927 days ago
Am I reading it right that performance was roughly comparable with GPT-3.5? How is this even possible?
4 comments

Not really. They already chose to show the benchmark where it does best and even then it’s still quite a bit worse (though definitely impressive for its size). If you take a look at other benchmarks, for example MMLU@5-shot then this does 46.3, while gpt-3.5 does 70.

But there might be some use cases where this one is close enough in performance and the difference in cost and speed make it a better choice.

No it's not (according to their benchmarks).

Zephyr-7B-B still beats it in most benchmarks but it's close.

This model is almost Zephyr-7B-B performance at 3B size which is a lot better for inference requirements.

Yeah got a way to beat 3.5 but it beats most of the first generation llama tunes even guacano 65b

Lots of improvements to go

By comparing on benchmarks that are either limited, or have data leaks, or in most cases just don't make sense in terms of usability - I've personally stopped looking at benchmarks to compare models. Personally, if I want to try a new model I hear a lot of chatter about, I use it for a few hours in my daily workflow. My baseline is GPT3.5 and GPT4, and I compare the models with them in terms of my day to day usage.
So in your experience which open model is currently the best?
The LLM field is still messy at large, if you look at the rankings of model performance, they still do not reflect their usability in real life. I think one major challenge is to find a corresponding benchmark.