Hacker News new | ask | show | jobs
by girvo 2 days ago
> DGX Spark-alike is really just asking for trouble. Prefill kills perf.

You're right that prefill kills perf, but shrug the GB10 has far more compute than it has memory bandwidth, so prefill isn't it's bottleneck.

1 comments

I've seen the same, Sparks are great at non time-sensitive tasks. if you can set up a agentic loop that does not require human intervention, you can design around the memory bandwidth limitations
The other benefit is that speculative decoding literally trades compute to make up for low bandwidth, so MTP/EAGLE/DFlash are unreasonably effective on the GB10 IMO, as long as your use case fits it.

I’m getting 40tk/s decode with 1000+tk/prefill with a 198B-A11B model on mine

I thought MTP wasn't very useful on MoE models because the expert overlap for 2 tokens was too small.
Still helps, and Step 3.5/3.7 were specifically trained for MTP (in a weird triple layer/triple head fashion with a kind of unique architecture)

With the currently-in-PR implementation it doubles decode performance for all the tasks I've been testing it against, at in the worst case is still a 35% uplift, so on a box with heaps of compute and not much memory bandwidth, it's worth it in practice