Hacker News new | ask | show | jobs
by thriw63748 876 days ago
Why not normal RAM? Ryzen 5600 with 128GB DDR4 is perfectly fine to run mixtral 8bit, and costs less than $1000.

GPUs are only needed if you can not wait 5 minutes for an answer, or for training.

7 comments

Or if you want multiple sessions at the same time. Or if you want to do anything else with your machine while it's running.

But realistically, 5 minutes is too long. It should be conversational, and for that you need at least 5 tokens per second. Which your Ryzen just can't do.

>It should be conversational, and for that you need at least 5 tokens per second.

To be fair, a lot of people are using this for non-interactive work, like batching document analysis or offline processing of user generated content.

This particular thread we are commenting on is about Dolphin Mixtral, which is mostly used for offline code completion (à là Microsoft GitHub Copilot). You don’t want to have to wait 5 minutes at every keystroke to get code suggestions.
In my experience, it takes some experimentation to figure out a good prompt. I don’t think I would have gotten very far off I had to wait that long for each result.
Why not both? Llama.cpp allows layering GGUF models between GPU and CPU memory.
> GPUs are only needed if you can not wait 5 minutes for an answer

Yeah, but that's generally true (or at least, “5 minutes for an answer is very suboptimal”, even if “can’t” isn’t quite true) for interactive use cases, which are... a lot of LLM use cases.

Not sure why you're getting downvoted. It performs decent enough on my Ryzen 3600X with 64GB of RAM. It definitely wouldn't be usable for production or fine-tuning, but it's fine for experimenting.
> perfectly fine

Only for very short context and responses.

Beyond that, the performance is painful.

That was what I was referring to with the 32/64 GB systems.
What's the bandwidth between the Ryzen and that DDR4?