|
|
|
|
|
by ryan_glass
380 days ago
|
|
I run Deepseek V3 locally as my daily driver and I find it affordable, fast and effective. The article assumes GPU which in my opinion is not the best way to serve large models like this locally. I run a mid-range EPYC 9004 series based home server on a supermicro mobo which cost all-in around $4000. It's a single CPU machine with 384GB RAM (you could get 768GB using 64GB sticks but this costs more). No GPU means power draw is less than a gaming desktop. With the RAM limitation I run an Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original. It is around 270GB which leaves plenty of room for context - I run 16k context normally as I use the machine for other things too but can up it to 24k if I need more. I get about 9-10 tokens per second, dropping to 7 tokens/second with a large context. There are plenty of people running similar setups with 2 CPUs who run the full version at similar tokens/second. |
|
How close are we talking?
I’m not calling you a liar OP, but in general I wish people perpetuating such broad claims would be more rigorous.
Unsloth does amazing work, however as far as I’m aware even they themselves do not publish head to head evals with the original unquantized models.
I have sympathy here because very few people and companies can afford to run the original models, let alone engineer rigorous evals.
However I felt compelled to comment because my experience does not match. For relatively simple usage the differences are hard to notice, but they become much more apparent in high complexity and long context tasks.