| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by binyu 7 hours ago
	Personally, I've tried to squeeze more tok/s for a single DGX Spark deployment and DeepSeek V4 Flash but only got marginal improvements. There's work to do on fusing kernels and other optimizations that are already on antirez's roadmap so it is not worth duplicating efforts. I've had positive experiences running GLM 4.7 via vLLM, tool calling works well and the inference is fast. Do you run DeepSeek V4 Flash on vLLM?

1 comments

Yep, those are the numbers I'm getting with DSv4 Flash on vLLM across 2 sparks.