Hacker News new | ask | show | jobs
by Jackson__ 1178 days ago
>* According to a fun and non-scientific evaluation with GPT-4. Further rigorous evaluation is needed.

I am so sick of seeing these ridiculous claims made about finetuned versions of llama, with 0 scientific rigor behind them.

This is, I believe, the 3rd llama finetune I've seen posted within the past 2 weeks, of which all claim "similar to ChatGPT" quality, while not actually running it through a _single_ of the many openly available language model benchmarks.

5 comments

On one hand, these are basically student projects, so we shouldn’t be so critical. OTOH, they’re being branded and marketed like products, so their claims deserve scrutiny.
There are certainly some effective language model benchmarks; however, they are not well-suited for evaluating a chat assistant. Some projects employ human evaluation, while this blog post explores an alternative approach based on GPT-4. Both methods have their advantages and disadvantages, making this blog post an intriguing case study that can inspire the future development of more comprehensive evaluations.
I don't think there are any benchmarks for chat models. You could just do the usual lambada, etc., but what's the point? We already know the scores for llama and that RLHF doesn't meaningfully improve capabilities.
Dolly from databricks is another example. They released the weights/model on huggingface and I can run generations with it on my M1 MacBook but it’s very slow.
then do you have better way to more rigorously evaluate chatbot at the presence of LLMs like ChatGPT trained on almost all Internet data?
> running it through a _single_ of the many openly available language model benchmarks.
but how could you guarantee the pretrained model haven't seen those benchmarks? And the baselines you are comparing to (chatgpt, bard) haven't as well? Cuz those benchmarking datasets are also collected from Internet right?
As a baseline, don't! If it performs horribly on the test and it cheated, that's even worse than if it fails the test and didn't cheat. So the benchmark score gives you an upper bound on performance.