Hacker News new | ask | show | jobs
by alfiedotwtf 46 days ago
What is everyone running DeepSeek v4 Flash with?!

It’s currently unsupported on Llama.cpp and vllm doesn’t support GPU+CPU MoE, so unless all of you have an array of DGX Sparks in your bedroom, what’s the secret sauce?!

2 comments

you can run it today with mlx if you have 256g or 512g mac studio. no "antirez" fork needed.

it isn't that large of a model and the compressed kv implementation is not that complicated

the problem is that they released the model in a quantized format that is more complex than it appears, and people make a lot of mistakes working with it. it is quantization-aware-trained, so you can't "just" upscale it and scale down.

vllm runs dsv4 flash fine right right now

dgx sparks cannot really run it correctly right now with released vllm but there are PRs, it's just a matter of time. you would need 3 of them. they will still be almost 1/2 as fast as the mac studio.

so the punchline is, well, this is why the 512g mac studio is such a hot commodity right now.

Unfortunately I didn't get a Mac with big ram at the time it was cheap, and I'd personally focus on moving away from Apple and going Linux fulltime at work and home (currently Macbook for laptop connected to my big rig, well it's not that big compared to the AI people in here).
What kind of RAM does your MacBook have? It might still be worth experimenting w/ DS4 using disk offload, though it would be dog slow at best and the RAM would be much too limited for meaningful parallelism, especially for larger contexts.
This might be my only hope until RAM prices come down to human levels again
If you have a 256 GB or 512 GB Mac Studio, the real game is to run multiple sessions in parallel in order to make the best use of your limited memory bandwidth. You'd have plenty of excess RAM for that given how small the KV cache is even at max context.
https://www.github.com/antirez/ds4 (from Antirez of Redis fame) runs a 2-bit quant on Apple Silicon hardware and 96GB or 128GB RAM.
I've been keeping an eye on Antirez's Metal fork for llama.cpp, but I totally missed this. Whoa, nice. Giving it a go, thanks!!
What kind of hardware are you planning to run this on? As mentioned already, I've been trying to understand how gracefully it might degrade on 64GB RAM or perhaps lower (the total weights size is 80GB at the provided quant) using SSD offload for the weights, and then (assuming it works and doesn't just OOM) whether the tok/s figures might meaningfully improve in that scenario by running multiple sessions in parallel.
I've got a 4060 Ti 12Gb with 128Gb RAM. I was hoping once I could demonstrate to myself that I could run Deepseek v4 Flash locally (even at really slow speeds), then it would be worth my time and money to get something to run it > 20t/s.

... currently testing out Stepfun 3.5 Flash Q4_k_m as a stop gap (unless it blows my socks off first).

I don't think the DS4 project supports the CPU/GPU split approach you'd need for best performance on that kind of hardware (shared layers on GPU, most experts on CPU). CPU-only inference would work but might be slow.
Ah dang. Hmm, damn this hobby is expensive. Maybe I should just take up drugs instead