|
|
|
|
|
by codeisawesome
867 days ago
|
|
Is there a tutorial on how to get that setup running step-by-step? I only found a GitHub issue (https://github.com/ggerganov/llama.cpp/issues/4439) that mentions that mainline llama.cpp isn't working for the model. Bonus question if you have the time: there's a release by TheBloke for this on HuggingFace (TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF); but I thought his models were "quantised" usually - does that kneecap any of the performance? |
|
As for your bonus question, that is the model you want. In general I'd choose the largest quantized version that you can fit based on your system. I'm personally running the 8bit version on my M3 Max MacBook Pro and it runs great! Performance is unfortunately a loaded word when it comes to LLMs because it can mean tokens per second or it can mean perplexity (i.e. how well the LLM responds). In terms of tokens per second, quantized models usually run a little faster because memory bandwidth is a constraint, so you're moving less memory around. In terms of perplexity there are different quantization strategies that work better and worse. I really don't think there's much of a reason for anyone to use a full 16fp model for inference, you're not really gaining much there. I think most people use the 4bit quants because it's a nice balance. But really it's just a matter of playing with the models and seeing how well it works. For example, some models perform okay when quantized down to 2 bits (I'm shocked that's the case, but I've heard people say that's the case in their testing), but Mixtral is not one of those models.