| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alfiedotwtf 46 days ago
	What is everyone running DeepSeek v4 Flash with?! It’s currently unsupported on Llama.cpp and vllm doesn’t support GPU+CPU MoE, so unless all of you have an array of DGX Sparks in your bedroom, what’s the secret sauce?!

2 comments

doctorpangloss 45 days ago

you can run it today with mlx if you have 256g or 512g mac studio. no "antirez" fork needed.

it isn't that large of a model and the compressed kv implementation is not that complicated

the problem is that they released the model in a quantized format that is more complex than it appears, and people make a lot of mistakes working with it. it is quantization-aware-trained, so you can't "just" upscale it and scale down.

vllm runs dsv4 flash fine right right now

dgx sparks cannot really run it correctly right now with released vllm but there are PRs, it's just a matter of time. you would need 3 of them. they will still be almost 1/2 as fast as the mac studio.

so the punchline is, well, this is why the 512g mac studio is such a hot commodity right now.

link

alfiedotwtf 45 days ago

Unfortunately I didn't get a Mac with big ram at the time it was cheap, and I'd personally focus on moving away from Apple and going Linux fulltime at work and home (currently Macbook for laptop connected to my big rig, well it's not that big compared to the AI people in here).

link

zozbot234 45 days ago

What kind of RAM does your MacBook have? It might still be worth experimenting w/ DS4 using disk offload, though it would be dog slow at best and the RAM would be much too limited for meaningful parallelism, especially for larger contexts.

link

alfiedotwtf 45 days ago

This might be my only hope until RAM prices come down to human levels again

link

zozbot234 45 days ago

If you have a 256 GB or 512 GB Mac Studio, the real game is to run multiple sessions in parallel in order to make the best use of your limited memory bandwidth. You'd have plenty of excess RAM for that given how small the KV cache is even at max context.

link

zozbot234 46 days ago

https://www.github.com/antirez/ds4 (from Antirez of Redis fame) runs a 2-bit quant on Apple Silicon hardware and 96GB or 128GB RAM.

link

alfiedotwtf 45 days ago

I've been keeping an eye on Antirez's Metal fork for llama.cpp, but I totally missed this. Whoa, nice. Giving it a go, thanks!!

link

zozbot234 45 days ago

What kind of hardware are you planning to run this on? As mentioned already, I've been trying to understand how gracefully it might degrade on 64GB RAM or perhaps lower (the total weights size is 80GB at the provided quant) using SSD offload for the weights, and then (assuming it works and doesn't just OOM) whether the tok/s figures might meaningfully improve in that scenario by running multiple sessions in parallel.

link

alfiedotwtf 45 days ago

I've got a 4060 Ti 12Gb with 128Gb RAM. I was hoping once I could demonstrate to myself that I could run Deepseek v4 Flash locally (even at really slow speeds), then it would be worth my time and money to get something to run it > 20t/s.

... currently testing out Stepfun 3.5 Flash Q4_k_m as a stop gap (unless it blows my socks off first).

link

zozbot234 45 days ago

I don't think the DS4 project supports the CPU/GPU split approach you'd need for best performance on that kind of hardware (shared layers on GPU, most experts on CPU). CPU-only inference would work but might be slow.

link

alfiedotwtf 45 days ago

Ah dang. Hmm, damn this hobby is expensive. Maybe I should just take up drugs instead

link