| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by trouve_search 47 days ago

On a 5090, gemma4 26B runs at 350TPS with the command below [1] and gemma4 31B is around 150TPS with a similar command.

I'm really surprised how much slower a DGX spark is for the same price.

1. Here's my command.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'

2 comments

diddid 46 days ago

With the 5090 you need to buy the rest of the computer though, and the Dgx spark will run 1/4th as slow but use 1/5th the electricity. And the spark would be able to run things the 5090 just couldn’t, like the Qwen3.5 122b. Which is all just to say that for llm workflows there is no easy answer. And if you media generation it gets even more complicated.

link

adam_arthur 47 days ago

Yes, I'd recommend a 5090 over the DGX Spark if your goal is general automation.

You can run multiple instances of these models in parallel on the DGX Spark which somewhat mitigates the difference if your task is parallelizable.

But I'd take the simplicity of a single thread and higher throughput personally.

Overall of course still better to wait for next gen devices if you can.

link