|
|
|
|
|
by trouve_search
5 hours ago
|
|
On a 5090, gemma4 26B runs at 350TPS with the command below [1] and gemma4 31B is around 150TPS with a similar command. I'm really surprised how much slower a DGX spark is for the same price. 1. Here's my command. PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \
--dtype auto \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--enable-chunked-prefill \
--enable-prefix-caching \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4 \
--max-num-batched 16000 \
--max-model-len 64000 \
--max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}' |
|
You can run multiple instances of these models in parallel on the DGX Spark which somewhat mitigates the difference if your task is parallelizable.
But I'd take the simplicity of a single thread and higher throughput personally.
Overall of course still better to wait for next gen devices if you can.