|
|
|
|
|
by wolttam
6 hours ago
|
|
I suspect DwarfStar could probably squeeze more performance out of the single spark, maybe up closer to 20tok/s. Moving to 2 sparks meant switching to vLLM with 2-way tensor parallelism and working multi-token prediction. The parallelism and MTP on top of better tuned kernels[1] gave an extremely nice boost! I was quite pleased. I've seen bursts up to 60tok/s at ~150k context - sometimes the MTP seems to really kick in (i.e. high acceptance rate on its tokens) Currently running a custom vLLM build put together by some folks on the Nvidia forums[2], which speaks to how early support for the model is. [1]: https://github.com/lukealonso/b12x [2]: https://forums.developer.nvidia.com/t/372268 |
|
I've had positive experiences running GLM 4.7 via vLLM, tool calling works well and the inference is fast. Do you run DeepSeek V4 Flash on vLLM?