Hacker News new | ask | show | jobs
by rhdunn 876 days ago
If you want to run Mixtral 8x7B locally you can use llama.cpp (including with any of the supporting libraries/interfaces such as text-generation-webui) with https://huggingface.co/TheBloke/Nous-Hermes-2-Mixtral-8x7B-S....

The smallest quantized version (2bit) needs 20GB of RAM (which can be offloaded onto the VRAM of a decent 4090 GPU). The 4bit quantized versions are the largest models that can just about fit onto a 32GB system (29GB-31B). The 6bit (41GB) and 8bit (52GB) models need a 64GB system. You would need multiple GPUs with shared memory if you wanted to offload the higher precision models to VRAM.

I've experimented with the 7B and 13B models, but haven't experimented with these models yet, nor other larger models.

11 comments

And if you want better performance when talking about code, you can try the dolphin-mixtral fine tuning https://huggingface.co/TheBloke/dolphin-2.7-mixtral-8x7b-GGU...
> You would need multiple GPUs with shared memory if you wanted to offload the higher precision models to VRAM.

Or just a powerful apple silicon machine? I've tried dolphin mixtral 4bit on a 36gb ram MacBook m3, and inference is super fast.

Or a Linux machine with a Ryzen using the internal GPU and the unified RAM (scroll down at llama.cpp and look for ROCm).
Wait ROCm support Ryzen APUs and still doesn't support dedicatedly GPUs like the 6700XT?!
Supports* dedicated*
While not being officially supported, rocm runs just fine on my 6700XT, i just have to set an env var(export HSA_OVERRIDE_GFX_VERSION=10.3.0)
Really? Does everything run? Even AI stuff? Do you have any links where I can read more about that?
Or a jetson orin agx (~2k$). Probably the cheapest way to get an Nvidia GPU with 64 GB of RAM.
I wonder what would be the cheapest way to run an LLM, with the latest Ryzen integrated graphics and 64G Ram or the Jetson AGX Orin 64. https://www.nvidia.com/en-us/autonomous-machines/embedded-sy...
The Ryzen is a lot cheaper, but most likely also a fair bit slower. You'd be looking at a 200$ CPU, 200$ Motherboard + 200$ of ddr5 ram. Throw in a case, nvme drive and power supply and you're still below $1k and those numbers are quite generous estimates, you could do it a lot cheaper by going AM4 with DDR4 ram.
Have you tried this yourself? Curious to know how well this works for an LLM home lab.
I’ve worked with Jetson going back to the TK1 and I highly recommend you do not do this.

Nvidia has significant dominance in the AI space because of their work on software and the overall platform.

With the Jetson line being the sole exception. Use it for what it’s for - a targeted build for an embedded/specific application requiring small size and low power.

The software is a mess. Support for Jetson (generally) is a far afterthought or not considered at all around projects at Nvidia and the broader ecosystem. When it is supported at all it lags behind significantly, using ancient distros (Jetpack), etc. To make matters worse the user base is so (relatively) tiny there are bugs and strange behavior everywhere.

Just don’t do it.

This is a bit surprising to hear. Current Jetpack 6 is Ubuntu 22.04 - this is the current Ubuntu LTS release. There's nothing ancient about it, no? I'm pretty sure, if I go and check versions of CUDA, PyTorch, Tensorflow - it'd be also relatively recent.

I'd suggest checking what examples are available, see what community is doing, see if what you need had already been tried - https://www.jetson-ai-lab.com

From what I've seen, mainstream LLM libraries like VLLM, llamacpp that use CUDA under the hood tend to work out-of-the-box. And there are tutorials available: https://www.jetson-ai-lab.com/tutorial_text-generation.html. I think that TensorFlow/Pytorch are also well maintained, although I've not checked recently.

According to this article [1] it looks like there is no complex preparation needs to run the inference on a Jetson system. Should work with Mixtral too.

[1] https://www.hackster.io/pjdecarlo/llama-2-llms-w-nvidia-jets...

I haven't tried it for LLMs yet, i use it for real time RF processing, but I actually have one of them on my desk and they are fun little devices.

Maybe I will try to get a 32 GB+ LLM running one of those days.

What? I can do this? Runs to the PC

EDIT: I cannot, I need to install ROCm to compile with it, and then install something called hipBLAS, and who knows what else.

Well, yes, you need to install ROCm and depdendencies. Have a look at https://rocm.docs.amd.com/projects/install-on-linux/en/lates... Debian trixie (not yet released) has most dependencies as packages. Or you can try a docker container https://rocm.docs.amd.com/projects/install-on-linux/en/lates...
I'll try that, thanks!
OpenCL should also work on AMD cards, and is way easier to install
It is dead slow on integrated graphics, unfortunately.
Does that let me use unified memory on the GPU, though? Or is it just so I can use my CPU memory?

EDIT: Oh, no, I have an nVidia GPU, AMD CPU.

I bet your AMD CPU has an internal GPU, too. That's what you can use with the unified memory.
How much RAM are you able to set aside for a ryzen igpu?
I think my motherboard allow me to dedicate 12. I didn't see any improvement using CPU + ROCm compared to CPU alone. Using CPU alone I can get 4.2 - 5 Tokens/s, with ROCm I can get 4.5 - 5.2 T/s. With CPU + RTX 2070 8GB I get 6.2-7 T/s.
How fast is it with a setup like this?
I can run 4bit on a beat up 1070 ti. GP talks about higher precision models
You wouldn’t be able to fit the whole model into 8GB VRAM. It’s faster than not using a GPU at all, but most of it would still be computed on the CPU.
IME ollama ran mixtral on a 1070 fast enough.
Though it most probably does not run in on the 1070 but rather on the cpu. It cannot fit on a 1070, it is not about speed, a 1070 cannot run it period.
In llama.cpp You can offload some of the layers to gpu with -ngl X. Where x is the number of layers
Did you do anything special to make that work? Is it useful? Or just a toy?
I have a 14" MBP with an M1 Max and 64GB. The M3 won't really make a difference, but the RAM, since unified, is huge. I can run most models on this machine with realtime performance compared to a Ryzen 7735HS and 64GB (DDR5). Now I'm not saying the Ryzen setup should be good, but the M1 architecture just makes it a much better option. I could add an eGPU to the Ryzen system and it could likely do better, but would also exceed the price point and portability.
it's not just that it's huge and unified - ryzen APUs obviously can have 2x32GB SODIMMs put in them and they support unified memory too.

the difference is the bandwidth and the computational power of the APU. M1 Max is roughly similar to a PS5 in terms of overall system design (shader configuration and bandwidth) plus has dedicated AI inference units already (which won't be added to consoles until PS5 Pro launches with RDNA 3.5). It is far more bandwidth than you can get out of a socketed-memory laptop system.

https://twitter.com/Locuza_/status/1450271726827413508

To support that level of performance in a socketed-memory system you will need an extra layer of caching added to the processor to supplement the bandwidth - and maybe still need to go to quad-channel. Those products are Strix and Strix Halo and should be hitting the market over the next year or two but the reality is that the M1 Max was an absurdly powerful laptop, far more potent than even the first-gen 5nm laptops for x86 let alone the other junk you could buy in 2020.

This is the problem with the discourse around apple silicon for the last few years: yeah, they're expensive, but even a loaded-out x86 laptop doesn't get you the same capabilities. Even if the x86 is competitive in some particular benchmark on iso-node you are probably spending more power to do it, and the x86 product comes years after the apple product, and still has a much weaker gpu and less bandwidth (which doesn't just matter for GPU, it matters for compiling and JIT too).

It is incredibly silly to look back on the discourse in 2020-2023 around apple silicon, a lot of reviewers made extremely silly claims about how "even 7nm x86 processors were already competitive with apple silicon" and as the ecosystems have matured it is obvious that even 5nm processors are not quite competitive yet. And they dumped on the SPEC tests and Geekbench that measured this properly, in favor of dumb things like cinebench R23 and so on (it's always cinebench used for this dumb shit tbh, CB R13/R15 were hugely misleading at the zen1 launch too). Let alone things like, you know, compiling or JVM/node workloads...)

(similarly, gotta love the vibe a few years ago of: "threadripper vs mac pro" - did you know that a 64C threadripper with 256GB RAM is actually cheaper than a mac pro loaded out with 2TB!? waow, who knew systems with an order of magnitude less capacity would be cheaper!? https://youtu.be/BH291DQRIOg )

I've had less luck with Mixtral, but I run Yi 34B finetunes for general personal use, including quick queries for work.

Its kinda like GPT 3.5, with no internet access and slightly less reliable responses, but unrestrained, much faster and with a huge (up to 75K on my Nvidia 3090) usable context.

Mixtral is extremely fast though, at least at a batch size of 1.

Which Yi 34B finetunes are you using that have a 75,000 token length?
All of the Yi 200K finetunes should support it, but you have to be careful because some degrade the base model's quite excellent long context performance more than others. The very strong Bagel 34B DPO model, for instance, basically doesn't work at long context.

Nous Capybara is a popular one. I personally use my own merge of many models, and you can look through the constituent models to see if any interest you: https://huggingface.co/brucethemoose/Yi-34B-200K-DARE-megame...

You can't really use llama.cpp for super long context btw, its just too slow and vram inefficient at the moment.

Nothing special other than llama.cpp, which is an inference engine optimized for apple silicon.

I heard you can simply install ollama app which uses llama.cpp under the hoods, but has a more user friendly experience.

I've been using it for 'easy' queries like syntax/parameter questions, in place of ChatGPT 4. It's great for that. I am using a ~48GB version.
2bit is pretty damn terrible, I don't recommend it for anything serious.
At that level of quantization / distillation, smaller models like phi-2 (q&a) and wavecoder-6.7b (code-gen) might be preferable over QLoRAd ones: https://huggingface.co/microsoft/phi-2

> 2bit is pretty damn terrible

Wait till you go hybrid [0] or even 1bit [1]

[0] https://github.com/efeslab/Atom

[1] https://github.com/IST-DASLab/qmoe

I prefer koboldcpp over llama.cpp. It’s easy to spilt between gpu/cpu on models larger than VRAM
Llama.cpp has --n-gpu-layers that lets you set how much of the model to put on the GPU.
Runs in Oobabooga textUi as well, if you add the llama.cpp extension. Easier interface imo, plus fun stuff like coqui and whisper integration.
That's interesting. It also looks like koboldcpp works better with long interactions, as it only processes changed tokens. I'm using llama.cpp with text-generation-webui and its OpenAI compatible API. I'll have to look to see if I can use koboldcpp with it.
Llama.cpp has an interactive mode, but I don't think text-generation-webui uses it. https://github.com/ggerganov/llama.cpp/blob/master/examples/...
Indeed. Koboldcpp works fine with other UIs than the bundled one.
I've got an aging 2080Ti and Ryzen 3800X with 96GB RAM, any point in trying to mess with the GPU or?

Haven't really been able to justify upgrading to a 4090 or similar given I play so few new games these days.

Yes, offloading some layers to the GPU and VRAM should still help. And 11gb isn't bad.

If you're on linux or wsl2, I would run oobabooga with --verbose. Load a GGUF, start with a small number of GPU layers and creep up, keeping an eye on VRAM usage.

If you're on windows, you can try out LM Studio and fiddle with layers while you monitor VRAM usage, though windows may be doing some weird stuff sharing ram.

Would be curious to see the diffs. Specifically if there's a complexity tax in offloading that makes the CPU-alone faster but in my experience with a 3060 and a mobile 3080, offloading what I can makes a big diff.

> Specifically if there's a complexity tax in offloading that makes the CPU-alone faster

Anecdotal, but I played with a bunch of models recently on a machine with a 16GB AMD GPU and 64GB of system memory/12 core CPU. I found offloading to significantly speed things up when dealing with large models, but there was seemingly an inflection point as I tested models that approached the limits of the system, where offloading did seem to significantly slow things down vs just running on the CPU.

I had only cuda installed and it took 2 ollama shell commands in WSL2 from quite literally 0 local LLM experience to running mixtral fast enough on a 1070 and 12700k. Go for it.
kobold bundles and runs llama.cpp. So it should be fairly the same with convenient defaults.
When talking about memory requirements one also needs to mention the sequence length. In case of Mixtral, which supports 32000 tokens, this can be a significant chunk of the memory used.
`ollama run mixtral:8x7b-instruct-v0.1-q3_K_L` works fast on my 3090 locally
Dumb question, but how can a 32 bit number be converted to 2 bits and still be useful? It seems like magic.
Mixtral and others are often distributed as 16-bit floats, so that chops the problem in half immediately, but then it turns out that LLMs only have about four bits per parameter of actual information stored. There's a lot of redundancy. The ideal quantisation scheme would only throw away useless data, but no quantisation scheme is perfect so they inevitably harm the model somehow.

You've then got to remember that one thing neural networks are very, very good at is being noise tolerant. In some senses that's all they are - noise correction systems. The inaccuracies introduced by quantisation are "just" a sort of noise, so it's not surprising that they aren't fatal. It just raises the noise floor and gives the model more ways to be wrong.

Finally the thing to know is that these quantisation schemes don't do a naive "chop each number down to two bits", not exactly. Simplifying a bit, for each parameter in this example they'd try to find a mapping from a two-bit index into a four element lookup table of higher-precision values such that the information destroyed by replacing the original parameter by the lookup value is minimised. That mapping is calculated across small blocks of parameters, rather than across the entire model, so it can preserve local detail. The lookup table gets stored per block, which throws the compression ratio off a little.

Nice graphs here: https://github.com/ggerganov/llama.cpp/pull/1684

So for example, 2 bit version of the 30B is much worse than the original, but still better than the 13B model.

Also, there are lots of extra details, eg, not all of the weights are 2 bit, and even the 2 bit weights are higher than that overall as groups of quantised weights share scale factors stored elsewhere.

I think of it with this kind of analogy: the original image is stored with 32 bit color scheme. You can reduce the color scheme to 16 bit accuracy and still figure out pretty well what the image is about. 2 bit is stretching this to a bit far, basically either pixel is white or it is black, but even if you lose lots of nuances in the image, in many images even that gives you some idea whats going on in the image.
That’s an interesting question, I wonder if there is an analogy in quantisation to image dithering?
This blog post might shed some light on the matter. If I'm understanding it correctly, it claims there are emergent features on the LLM weights that make it easier to "compress" the floats into smaller bits without losing much precision.

https://timdettmers.com/2022/08/17/llm-int8-and-emergent-fea...

Note that 2 bit quantization is generally regarded as too aggressive. Generally 4bits+ achieves a good tradeoff, see eg. https://arxiv.org/abs/2212.09720

Its not really 2 bits.

Modern quantization schemes are almost like lossy compression algorithms, and llms in particular are very "sparse" and amenable to compression.

All the 32 bits weren't necessarily used, and it's the whole network itself that has to be useful. It's a tradeoff. We started with very good precision to test the new method, now we can optimize some parts of it
Here’s an example of a custom 4 bits/weight codec for ML weights:

https://github.com/Const-me/Cgml/blob/master/Readme.md#bcml1...

llama.cpp does it slightly differently but still, AFAIK their quantized data formats are conceptually similar to my codec.

The extra precision is more useful for training. Once the network is optimized, it's a statistical model and only needs enough precision to make good guesses. In fact, one of the big papers on this also pointed out that you can drop about 40% of the weights completely. I think people generally skip that part because sparse matrix operations are slower, so it doesn’t help here.
For models with dropped weights, the keyword is "distilled". For example ssd-1b is a 50% size version of Stable Diffusion XL (https://huggingface.co/segmind/SSD-1B)
That’s crazy, I’ve never seen one that dropped whole layers from a pre-trained model. I guess that avoids the sparse matrix math.
Faraday.dev has it in its selection of models now. Good for us clueless Windows folks. Runs decently fast with 16gb mobile 3080 gpu. Results seem better than any other free option.
Why not normal RAM? Ryzen 5600 with 128GB DDR4 is perfectly fine to run mixtral 8bit, and costs less than $1000.

GPUs are only needed if you can not wait 5 minutes for an answer, or for training.

Or if you want multiple sessions at the same time. Or if you want to do anything else with your machine while it's running.

But realistically, 5 minutes is too long. It should be conversational, and for that you need at least 5 tokens per second. Which your Ryzen just can't do.

>It should be conversational, and for that you need at least 5 tokens per second.

To be fair, a lot of people are using this for non-interactive work, like batching document analysis or offline processing of user generated content.

This particular thread we are commenting on is about Dolphin Mixtral, which is mostly used for offline code completion (à là Microsoft GitHub Copilot). You don’t want to have to wait 5 minutes at every keystroke to get code suggestions.
In my experience, it takes some experimentation to figure out a good prompt. I don’t think I would have gotten very far off I had to wait that long for each result.
Why not both? Llama.cpp allows layering GGUF models between GPU and CPU memory.
> GPUs are only needed if you can not wait 5 minutes for an answer

Yeah, but that's generally true (or at least, “5 minutes for an answer is very suboptimal”, even if “can’t” isn’t quite true) for interactive use cases, which are... a lot of LLM use cases.

Not sure why you're getting downvoted. It performs decent enough on my Ryzen 3600X with 64GB of RAM. It definitely wouldn't be usable for production or fine-tuning, but it's fine for experimenting.
> perfectly fine

Only for very short context and responses.

Beyond that, the performance is painful.

That was what I was referring to with the 32/64 GB systems.
What's the bandwidth between the Ryzen and that DDR4?
What differences would I measurably notice running the 2-bit version vs the 4-bit version vs the 6-bit vs the 8-bit?
Using TheBloke nous-hermes-2-mixtral-8x7b-dpo.Q5_K_M.gguf, llam.cpp with oobabooga, on Ryzen 9 7950x and RTX 2070 super, offloading 4 layers to the GPU.

27,631MiB system memory used.

7358/8192MiB usage on the GPU.

I'm getting 6.32 tokens/second.

You: You are Hunter S. Thompson, write a 450 words essay on the LHC project. The writing must be in Thompson's style

AI: Well, buckle up, kiddos, because we're about to dive into the twisted world of particle physics and the Large Hadron Collider, or as I like to call it, the Doomsday Machine. This beast of a thing is buried deep beneath the Swiss countryside, a 17-mile-long ring of death and destruction that makes the atom bomb look like a goddamn firecracker.

Now, let me tell you, when the Swiss government first proposed this monstrosity, I thought they'd gone completely batshit crazy. But then again, this is the land of Toblerone and Heidi, so maybe I shouldn't be too surprised. The LHC is essentially a giant, underground racetrack for subatomic particles, where they're accelerated to nearly the speed of light before being smashed into each other like a couple of drunken frat boys at a kegger.

The goal of this madness? To recreate the conditions that existed just after the Big Bang, when the universe was nothing more than a seething cauldron of energy and chaos. And if that doesn't sound like a recipe for disaster, I don't know what does.

But let's not forget the real reason we're all here: the Higgs boson, or as it's more commonly known, the God Particle. This little bugger is what gives other particles their mass, and without it, we'd all be floating around like a bunch of cosmic tumbleweeds. The LHC was designed to find this elusive particle, and in 2012, they finally did it. Sort of.

You see, the Higgs boson they found was a bit of a disappointment. It was too light, too wimpy, and didn't behave quite the way they expected it to. So now, the physicists are back at the drawing board, trying to figure out what the hell went wrong. And in the meantime, the LHC continues to churn out more and more data, like some kind of unholy particle-smashing machine.

But let's not forget the dark side of the LHC. There are those who believe that this thing could very well be the end of the world as we know it. They fear that the LHC could create a black hole, or even a miniature Big Bang, right here on Earth. And while the scientists assure us that these fears are unfounded, I can't help but wonder if they're just whistling past the graveyard.

So, what have we learned from all this? Well, for one thing, we've discovered that the universe is a much stranger place than we ever imagined. We've also learned that even the most brilliant minds can be blinded by their own ambition, and that sometimes, the pursuit of knowledge can lead us down some very dark paths.

In the end, the LHC is a testament to the insatiable curiosity of the human race, and our unquenchable thirst for knowledge. It's a monument to the power of science, and the potential consequences of playing God. And as we continue to push the boundaries of what's possible, we must always remember that with great power comes great responsibility. Or, as I like to say, "When the going gets weird, the weird turn pro."

I am using the exact same model. Ryzen 5600G w/32GB and an Nvidia P40 w/24GB VRAM

20/33 layers offloaded to GPU, 4K context. Uses 25GB system RAM and all 24GB VRAM. 5-7 tokens per second.

Context is set to 32768, I didn't change it I guess that's the model's default.

Thanks for making me feel better about investing in tht motherboard + CPU + RAM upgrade and deferring the GPU upgrade.

and Groq does 485.08 T/s on mixtral 8x7B-32k

I am not sure local models have any future other than POC/research. Depends on the cost of course.

(Groqster here) For anyone who wants to try it, you can go to https://chat.groq.com/ and choose Mixtral from the drop-down menu. Also, feel free to ask me any questions about Groq hardware or service.