| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jxy 793 days ago

This inductive logic is way overblown.

> Incredible, beat Llama 3 8B with 3.8B parameters after less than a week of release.

Judging by a single benchmark? Without even trying it out with real world usage?

> And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Any potential caveat in such a leaderboard not withstanding, on that leaderboard alone, there is a huge gap between llama 3 8B and Mistral-Large, let alone any of the GPT-4.

By the way, for beating benchmark, "Pretraining on the Test Set Is All You Need"

1 comments

oersted 793 days ago

It's easy to miss: select English in the dropdown. The scores are quite different in Overall and in English for LMSYS.

As I've stated in other comments, yeah... Agreed, I'm stretching it a bit. It's just that any indication of a 3.8B model being in the vicinity of GPT-4 is huge.

I'm sure that when things are properly measured by third-parties it will show a more sober picture. But still, with good fine-tunes, we'll probably get close.

It's a very significant demonstration of what could be possible soon.

link

saretup 793 days ago

Firstly, English is a highly subjective category.

Secondly, Llama 3 usually adds first sentences like ‘What a unique question!’ or ‘What an insightful thought’, which might make people like it more than the competition because of the pandering.

While Llama 3 is singular in terms of size to quality ratio, calling the 8B model close to GPT4 would be an overstretch.

link

YetAnotherNick 793 days ago

Yes, I don't know how people don't realize how much cheap tricks works in Chatbot Arena. A single base model produces 100s of ELO difference depending on the way it is tuned. And on most cases, instruction tuning heavily slightly even decreases reasoning ability on standard benchmark. You can see base model scores better in MMLU/ARC most of the times in huggingface leaderboard.

Even GPT-4-1106 seems to only sounds better than GPT-4-0613 and works for wider range of prompt. But in a well defined prompt and follow up questions I don't think there is an improvement in reasoning.

link

imtringued 793 days ago

When I tried Phi2 it was just bad. I don't know where you got this fantasy from that people accept obviously wrong answers, because of "pandering".

link

YetAnotherNick 793 days ago

Obviously correct answer matters more but ~100-200 elo points could be gained just for better writing. Answer would be range of 500 elo in comparison.

link

rgbrgb 793 days ago

> just for better writing

in my use cases, better writing makes a better answer

link