| HN Mirror

Comparing this to 70b doesn't make sense: this is a 12b model, which should easily fit on consumer GPUs. A 70b will have to be quantized to near-braindead to fit on a consumer GPU; 4bit is about as small as you can go without serious degradation, and 70b quantized to 4bit is still ~35GB before accounting for context space. Even a 4090 can't run a 70b.

Supposedly Mistral NeMo better than Llama-3-8b, which is the more apt comparison, although benchmarks usually don't tell the full story; we'll see how it does on the LMSYS Chatbot Arena leaderboards. The other (huge) advantage of Mistral NeMo over Llama-3-8b is the massive context window: 128k (and supposedly 1MM with RoPE scaling, according to their HF repo), vs 8k.

Also, this was trained with 8bit quantization awareness, so it should handle quantization better than the Llama 3 series in general, which will help more people be able to run it locally. You don't need a 4090.