Hacker News new | ask | show | jobs
by hehdhdjehehegwv 781 days ago
Funny thing is I’m still in love with Mistral 7B as it absolutely shreds on a nice GPU. For simple tasks it’s totally sufficient.
1 comments

Llama3 8B is for all intents and purposes just as fast.
Mistral 7b inferences about 18% faster for me as a 4bit quantized version on an A100. Thats definitely relevant when running anything but chatbots.
Are you measuring tokens/sec or words per second?

The difference matters as generally in my experience, Llama 3, by virtue of its giant vocabulary, generally tokenizes text with 20-25% less tokens than something like Mistral. So even if its 18% slower in terms of tokens/second, it may, depending on the text content, actually output a given body of text faster.