| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Jackson__ 1178 days ago

>* According to a fun and non-scientific evaluation with GPT-4. Further rigorous evaluation is needed.

I am so sick of seeing these ridiculous claims made about finetuned versions of llama, with 0 scientific rigor behind them.

This is, I believe, the 3rd llama finetune I've seen posted within the past 2 weeks, of which all claim "similar to ChatGPT" quality, while not actually running it through a _single_ of the many openly available language model benchmarks.

5 comments

kristjansson 1178 days ago

On one hand, these are basically student projects, so we shouldn’t be so critical. OTOH, they’re being branded and marketed like products, so their claims deserve scrutiny.

link

MMMercy2 1178 days ago

There are certainly some effective language model benchmarks; however, they are not well-suited for evaluating a chat assistant. Some projects employ human evaluation, while this blog post explores an alternative approach based on GPT-4. Both methods have their advantages and disadvantages, making this blog post an intriguing case study that can inspire the future development of more comprehensive evaluations.

link

ImprobableTruth 1178 days ago

I don't think there are any benchmarks for chat models. You could just do the usual lambada, etc., but what's the point? We already know the scores for llama and that RLHF doesn't meaningfully improve capabilities.

link

d4rkp4ttern 1178 days ago

Dolly from databricks is another example. They released the weights/model on huggingface and I can run generations with it on my M1 MacBook but it’s very slow.

link

zhisbug 1178 days ago

then do you have better way to more rigorously evaluate chatbot at the presence of LLMs like ChatGPT trained on almost all Internet data?

link

cscurmudgeon 1178 days ago

> running it through a _single_ of the many openly available language model benchmarks.

link

zhisbug 1178 days ago

but how could you guarantee the pretrained model haven't seen those benchmarks? And the baselines you are comparing to (chatgpt, bard) haven't as well? Cuz those benchmarking datasets are also collected from Internet right?

link

CGamesPlay 1178 days ago

As a baseline, don't! If it performs horribly on the test and it cheated, that's even worse than if it fails the test and didn't cheat. So the benchmark score gives you an upper bound on performance.

link