Hacker News new | ask | show | jobs
by ldjkfkdsjnv 1022 days ago
I dont trust any benchmarks for any LLM thats not coming from FB, Google, OpenAI, Anthropic, or Microsoft. These models are so dynamic, the simple benchmark numbers never tell the whole story of the quality of the model. Take for instance, a recent posting by anyscale, claiming their fine tuning of Llama 2 was competitive with OpenAI's model. The reality being their fined tuned model is basically worthless, and was competitive along a single metric/very narrow commoditized task. Its a great way to get clicks by posting these metrics though
3 comments

They could have easily benchmarked with the Spider SQL test set but they didn’t.

I have a feeling that the more robust models might be the ones that don’t perform best on benchmarks right away.

The community has fine-tuned some really good llama models (much better than llama-chat), but I get what you're saying.

I've been testing the best performing models on the huggingface leaderboard lately. Some of them are really impressive, and others are so bad that I second guess the prompt format or if the benchmarked model is actually the same one I'm testing.

Which models were really bad?
I was keeping track of the good ones, and don't have many notes on the bad ones.

I do remember testing "LoKuS" last week and it was quite terrible (sometimes gave completely off-topic answers). It scored as one of the highest 13B models on the leaderboard (~65 average), but appears to be removed now.

This is the goal of humaneval, correct?