|
|
|
|
|
by binyu
7 hours ago
|
|
Personally, I've tried to squeeze more tok/s for a single DGX Spark deployment and DeepSeek V4 Flash but only got marginal improvements. There's work to do on fusing kernels and other optimizations that are already on antirez's roadmap so it is not worth duplicating efforts. I've had positive experiences running GLM 4.7 via vLLM, tool calling works well and the inference is fast. Do you run DeepSeek V4 Flash on vLLM? |
|