Hacker News new | ask | show | jobs
by nabla9 298 days ago
Its for comparison using raw, non optimized models. Both can do much better when you optimize for inference.

Information is in the ratio of these numbers. They stay the same.

1 comments

Ok then just to clarify: you can fit 4x larger models on the Spark vs 5090, not 17x.
@nabla9 have tried to tell you that for DGX Spark, you can also use optimized models; therefore, this means that Spark can also be used for inference with bigger models, such as those exceeding 200B.

Please compare the same things: carrots VS carrots, not apples VS eggs.

I don't understand what's not optimized on 5090. If we're comparing with Apple chips or AMD Strix Halo yes you will have very different hardware + software support, no FP4 etc. but here everything is CUDA, Blackwell vs Blackwell, same FP4 structured sparsity, so I don't get how it would be honest to compare a quantized FP4 model on Spark with an unoptimized FP16 model on a 5090 ?
To me, what I think they are saying is that the Spark can use a FP16 unoptimized model with 200B parameters. However I don't really know.
You can't. The Spark has 128GB VRAM; the highest you can go in FP16 is 64B — and that's with no space for context.

200B is probably a rough estimate of Q4 + some space for context.

The Spark has 4x the VRAM of a 5090. That's all you need to know from a "how big can it go" perspective.

from the NVidia DGX Spark datasheet:

  With 128 GB of unified system memory, developers can experiment, fine-tune, or inference models of up to 200B parameters. Plus, NVIDIA ConnectX™ networking can connect two NVIDIA DGX Spark supercomputers to enable inference on models up to 405B parameters.
You and nabla9 are both the one comparing apples and eggs. 4x more RAM means 4x larger models when everything else is held the same to make a fair comparison.