It’s currently unsupported on Llama.cpp and vllm doesn’t support GPU+CPU MoE, so unless all of you have an array of DGX Sparks in your bedroom, what’s the secret sauce?!
you can run it today with mlx if you have 256g or 512g mac studio. no "antirez" fork needed.
it isn't that large of a model and the compressed kv implementation is not that complicated
the problem is that they released the model in a quantized format that is more complex than it appears, and people make a lot of mistakes working with it. it is quantization-aware-trained, so you can't "just" upscale it and scale down.
vllm runs dsv4 flash fine right right now
dgx sparks cannot really run it correctly right now with released vllm but there are PRs, it's just a matter of time. you would need 3 of them. they will still be almost 1/2 as fast as the mac studio.
so the punchline is, well, this is why the 512g mac studio is such a hot commodity right now.
Unfortunately I didn't get a Mac with big ram at the time it was cheap, and I'd personally focus on moving away from Apple and going Linux fulltime at work and home (currently Macbook for laptop connected to my big rig, well it's not that big compared to the AI people in here).
What kind of RAM does your MacBook have? It might still be worth experimenting w/ DS4 using disk offload, though it would be dog slow at best and the RAM would be much too limited for meaningful parallelism, especially for larger contexts.
If you have a 256 GB or 512 GB Mac Studio, the real game is to run multiple sessions in parallel in order to make the best use of your limited memory bandwidth. You'd have plenty of excess RAM for that given how small the KV cache is even at max context.
What kind of hardware are you planning to run this on? As mentioned already, I've been trying to understand how gracefully it might degrade on 64GB RAM or perhaps lower (the total weights size is 80GB at the provided quant) using SSD offload for the weights, and then (assuming it works and doesn't just OOM) whether the tok/s figures might meaningfully improve in that scenario by running multiple sessions in parallel.
I've got a 4060 Ti 12Gb with 128Gb RAM. I was hoping once I could demonstrate to myself that I could run Deepseek v4 Flash locally (even at really slow speeds), then it would be worth my time and money to get something to run it > 20t/s.
... currently testing out Stepfun 3.5 Flash Q4_k_m as a stop gap (unless it blows my socks off first).
I don't think the DS4 project supports the CPU/GPU split approach you'd need for best performance on that kind of hardware (shared layers on GPU, most experts on CPU). CPU-only inference would work but might be slow.
it isn't that large of a model and the compressed kv implementation is not that complicated
the problem is that they released the model in a quantized format that is more complex than it appears, and people make a lot of mistakes working with it. it is quantization-aware-trained, so you can't "just" upscale it and scale down.
vllm runs dsv4 flash fine right right now
dgx sparks cannot really run it correctly right now with released vllm but there are PRs, it's just a matter of time. you would need 3 of them. they will still be almost 1/2 as fast as the mac studio.
so the punchline is, well, this is why the 512g mac studio is such a hot commodity right now.