| Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release. And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large. Source: https://chat.lmsys.org/?leaderboard (select English in the dropdown) So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild. (I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...) Phi-3-mini 3.8b: 71.2 Phi-3-small 7b: 74.9 Phi-3-medium 14b: 78.2 Phi-2 2.7b: 58.8 Mistral 7b: 61.0 Gemma 7b: 62.0 Llama-3-In 8b: 68.0 Mixtral 8x7b: 69.9 GPT-3.5 1106: 75.3 (these are averages across all tasks for each model, but looking at individual scores shows a similar picture) |
> Incredible, beat Llama 3 8B with 3.8B parameters after less than a week of release.
Judging by a single benchmark? Without even trying it out with real world usage?
> And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.
Any potential caveat in such a leaderboard not withstanding, on that leaderboard alone, there is a huge gap between llama 3 8B and Mistral-Large, let alone any of the GPT-4.
By the way, for beating benchmark, "Pretraining on the Test Set Is All You Need"