| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by oersted 793 days ago

Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.

And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown)

So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild.

(I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...)

Phi-3-mini 3.8b: 71.2

Phi-3-small 7b: 74.9

Phi-3-medium 14b: 78.2

Phi-2 2.7b: 58.8

Mistral 7b: 61.0

Gemma 7b: 62.0

Llama-3-In 8b: 68.0

Mixtral 8x7b: 69.9

GPT-3.5 1106: 75.3

(these are averages across all tasks for each model, but looking at individual scores shows a similar picture)

10 comments

jxy 793 days ago

This inductive logic is way overblown.

> Incredible, beat Llama 3 8B with 3.8B parameters after less than a week of release.

Judging by a single benchmark? Without even trying it out with real world usage?

> And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.

Any potential caveat in such a leaderboard not withstanding, on that leaderboard alone, there is a huge gap between llama 3 8B and Mistral-Large, let alone any of the GPT-4.

By the way, for beating benchmark, "Pretraining on the Test Set Is All You Need"

link

oersted 793 days ago

It's easy to miss: select English in the dropdown. The scores are quite different in Overall and in English for LMSYS.

As I've stated in other comments, yeah... Agreed, I'm stretching it a bit. It's just that any indication of a 3.8B model being in the vicinity of GPT-4 is huge.

I'm sure that when things are properly measured by third-parties it will show a more sober picture. But still, with good fine-tunes, we'll probably get close.

It's a very significant demonstration of what could be possible soon.

link

saretup 793 days ago

Firstly, English is a highly subjective category.

Secondly, Llama 3 usually adds first sentences like ‘What a unique question!’ or ‘What an insightful thought’, which might make people like it more than the competition because of the pandering.

While Llama 3 is singular in terms of size to quality ratio, calling the 8B model close to GPT4 would be an overstretch.

link

YetAnotherNick 793 days ago

Yes, I don't know how people don't realize how much cheap tricks works in Chatbot Arena. A single base model produces 100s of ELO difference depending on the way it is tuned. And on most cases, instruction tuning heavily slightly even decreases reasoning ability on standard benchmark. You can see base model scores better in MMLU/ARC most of the times in huggingface leaderboard.

Even GPT-4-1106 seems to only sounds better than GPT-4-0613 and works for wider range of prompt. But in a well defined prompt and follow up questions I don't think there is an improvement in reasoning.

link

imtringued 793 days ago

When I tried Phi2 it was just bad. I don't know where you got this fantasy from that people accept obviously wrong answers, because of "pandering".

link

YetAnotherNick 793 days ago

Obviously correct answer matters more but ~100-200 elo points could be gained just for better writing. Answer would be range of 500 elo in comparison.

link

ignoramous 793 days ago

> Phi-3-mini 3.8b: 71.2

Per the paper, phi3-mini (which is english-only) quantised to 4bit uses 1.8gb RAM and outputs 1212 tokens/sec (correction: 12 tokens/sec) on iOS.

A model on par with GPT-3.5 running on phones!

(weights haven't been released, though)

link

coder543 793 days ago

> (weights haven't been released, though)

Phi-1, Phi-1.5, and Phi-2 have all had their weights released, and those weights are available under the MIT License.

Hopefully Microsoft will continue that trend with Phi-3.

> outputs 1212 tokens/sec on iOS

I think you meant "12 tokens/sec", which is still nice, just a little less exciting than a kilotoken/sec.

link

jph00 793 days ago

Weights will be realised tomorrow, according to one of the tech report authors on Twitter.

link

ignoramous 793 days ago

> you meant 12 tokens/sec

Thanks! The HTML version on archive.is has messed up markup and shows 1212 instead: https://archive.is/Ndox6

link

intellectronica 793 days ago

Weights are coming tomorrow.

link

homarp 793 days ago

tomorrow is now: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

link

karmasimida 793 days ago

Where did you get this from?

> So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones

No, not even close ... Even Gemini has huge UX gap comparing to GPT4/Opus, 8B I won't even attempt this argument.

link

alecco 793 days ago

At a glance, it looks like Phi-3 was trained on an English only, STEM-strong dataset. See how they are not as strong in HumanEval, Trivia, etc. But of course it's very good.

link

crakenzak 793 days ago

Can’t wait to see some Phi-3 fine tunes! Will be testing this out locally, such a small model that I can run it without quantization.

Feels incredible to be living in a time with such neck breaking innovations. What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

link

nl 793 days ago

> What are chances we’ll have a <100B parameter GPT4/Claude Opus model in the next 5 years?

In 5 years time we'll have adaptive compute and the idea of talking about the parameter count of a model will seem as quaint as talking about the cylinder capacity of a jet engine.

link

regularfry 793 days ago

It feels like it's going to be closer than that. People always forget that GPT4 and Opus have the advantage of behind-the-curtain tool use that you just can't see, so you don't know how much of a knowledge or reasoning leg-up they're getting from their internal tooling ecosystem. They're not really directly comparable to a raw LLM downloaded from HF.

What we need is a standardised open harness for open source LLMs to sit in that gives them both access to tools and the ability to write their own, and that's (comparatively speaking) a much easier job than training up another raw frontier LLM: it's just code, and they can write a lot of it.

link

Deverauxi 793 days ago

5 years? 5 years is a millennia these days.

We’ll have small local models beating gpt-4/Claude opus in 2024. We already have sub 100b models trading blows with former gpt-4 models, and the future is racing toward us. All these little breakthroughs are piling up.

link

refulgentis 793 days ago

Absolutely not on the first one. Not even close.

link

ashirviskas 793 days ago

Why not? There's still 7 months left for breakthroughs.

link

refulgentis 793 days ago

Small leaves wiggle room, but it's extremely unlikely trad small, <= 7B, will get there this year even on these evals.

UX matching is a whole different matter and needs a lot of work: Worked heavily with Llama 8B over last days, and Phi 3 today, and the Q+A benchmarks don't tell the full story. Ex. It's nigh impossible to get Llama _70_B to answer in JSON; when Phi sees RAG from search it goes off inventing new RAG material and a new question.

link

bugglebeetle 793 days ago

We already do. It’s called LLama 3 70B Instruct.

link

vitorgrs 793 days ago

Llama 3 is awful in non-English. 95% of their training data is in English....

GPT is still the king when talking about multiple languages/knowledge.

link

stavros 793 days ago

Is it released?

link

viraptor 793 days ago

On par in some categories. Phi was intended for reasoning, not storing data, due to small size. I mean, it's still great, but the smaller it gets, the more facts from outside of the prompts context will not be known at all.

link

candiodari 793 days ago

I wonder if that's a positive or negative. How does it affect hallucinations?

link

viraptor 793 days ago

It depends what you want to do. If you want a chat bot that can replace most Google queries, you want as much learned data as possible and the whole Wikipedia consumed. If you want a RAG style system, you want good reasoning about the context and minimal-or-no references to extra information. It's neither positive nor negative without a specific use case.

link

blackeyeblitzar 793 days ago

It’s not open source, but is open weight - like distributing a precompiled executable. In particular what makes it open weights rather than just weights available is that it is licensed using an OSI approved license (MIT) rather than a restricted proprietary license.

I really wish these companies would release the training source, evaluation suites, and code used to curate/filter training data (since safety efforts can lead to biases). Ideally they would also share the training data but that may not be fully possible due to licensing.

link

moralestapia 793 days ago

>And on LMSYS English, Llama 3 8B is well above GPT-4

Source?

link

oersted 793 days ago

Right thanks for the reminder, I added it

link

moralestapia 793 days ago

Thanks, I don't see them being "well above GPT-4", merely 1 point? Also, no idea why one would want to exclude GPT-4-Turbo, the flagship "GPT-4" model, but w/e.

I also don't think they "beat Llama 3 8B"; their own abstract says "rivals that of models such as Mixtral 8x7B and GPT-3.5", "rivals" not even "beats".

Great model, but let's not overplay it.

link

oersted 793 days ago

In the English category: GPT-4-0314 (ELO 1166), Llama 3 8B Instruct (ELO 1161), Mistral-Large-2402 (ELO 1151), GPT-4-0613 (ELO 1148).

You are right, I toned down the language, I got a bit overexcited, and I missed the difference in the versions of GPT-4. And LMSYS is a subjective benchmark for what users prefer, which I'm sure has weird inherent biases.

It's just that any signal of an 3.8B model being anywhere in the vicinity of GPT-4 is huge.

link

moralestapia 793 days ago

Yeah, GPT3.5, in a phone, at ~1,000 tokens/sec ... nice!

link

mlyle 793 days ago

> at ~1,000 tokens/sec

12 tokens per second.

link

zone411 793 days ago

> So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones?

No, we don't. LMsys is just one, very flawed benchmark.

link

ukuina 793 days ago

Why is LMsys flawed?

Many people treat LMsys as gospel because it's the only large-scale, up-to-date qualitative benchmark. All the numeric benchmarks seem to miss real-world applicability.

link

oersted 793 days ago

Agreed, but it's wild that even one benchmark shows this. Based on what we knew just a few months ago, these models should be so far from each other in every benchmark.

link

infecto 793 days ago

"But still"? Lets be realistic, all of these benchmark scores are absolute garbage. Yes, the open source community is making great strides, they are getting closer but the gap is still wide when comparing to commercially available models.

link