Hacker News new | ask | show | jobs
by codeisawesome 867 days ago
Is there a tutorial on how to get that setup running step-by-step? I only found a GitHub issue (https://github.com/ggerganov/llama.cpp/issues/4439) that mentions that mainline llama.cpp isn't working for the model.

Bonus question if you have the time: there's a release by TheBloke for this on HuggingFace (TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF); but I thought his models were "quantised" usually - does that kneecap any of the performance?

4 comments

If you're new to this then just download an app like LMStudio (which unfortunately is closed source, but it is free) which basically just uses llama.cpp under the hood. It's simple enough to get started with local LLMs. If you want something open source ollama is probably a good place to look too, it's just a CLI tool but several GUIs integrate with ollama specifically.

As for your bonus question, that is the model you want. In general I'd choose the largest quantized version that you can fit based on your system. I'm personally running the 8bit version on my M3 Max MacBook Pro and it runs great! Performance is unfortunately a loaded word when it comes to LLMs because it can mean tokens per second or it can mean perplexity (i.e. how well the LLM responds). In terms of tokens per second, quantized models usually run a little faster because memory bandwidth is a constraint, so you're moving less memory around. In terms of perplexity there are different quantization strategies that work better and worse. I really don't think there's much of a reason for anyone to use a full 16fp model for inference, you're not really gaining much there. I think most people use the 4bit quants because it's a nice balance. But really it's just a matter of playing with the models and seeing how well it works. For example, some models perform okay when quantized down to 2 bits (I'm shocked that's the case, but I've heard people say that's the case in their testing), but Mixtral is not one of those models.

Thank you so much for the detailed answer! I didn’t realize Ollama was OSS, I confused it with LMStudio’s licensing. I’ll try it out.

I would say I care a lot more about the perplexity performance than pure T(okens)PS… it’s good to be able to verbalize that.

I'm working on a blog post documenting what I've been doing as a newcomer to llama.cpp and the mixtral model. The steps can apply to any model really. Its mostly about optimization steps I'm experimenting with. Be warned its all new to me and my explanations may not be entirely accurate yet, as I'm still learning the lingo so to speak.

The blog is at https://geuis.com. I'll try to wrap it up today or tomorrow and get the post out.

Check out ollama: https://ollama.ai/

It's easy to get running and doesn't require you to manually download models.

Ollama is great, and they just added (are still adding) OpenAPI API compatible endpoints, thus opening up access to many other toolchain possibilities than previously available to it. It also has some support for some multi-modal (vision and text) models. Easy to use, easy to install, does the job it's designed to do (rather well, even)... Highly recommended!
There's walkthroughs on reddit.com/r/localllama. You can download one click installers for oobabooga, then it's just a matter of getting the model you want and making sure the config is correct.