| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by girvo 2 days ago
	> DGX Spark-alike is really just asking for trouble. Prefill kills perf. You're right that prefill kills perf, but shrug the GB10 has far more compute than it has memory bandwidth, so prefill isn't it's bottleneck.

1 comments

htrp 2 days ago

I've seen the same, Sparks are great at non time-sensitive tasks. if you can set up a agentic loop that does not require human intervention, you can design around the memory bandwidth limitations

link

girvo 2 days ago

The other benefit is that speculative decoding literally trades compute to make up for low bandwidth, so MTP/EAGLE/DFlash are unreasonably effective on the GB10 IMO, as long as your use case fits it.

I’m getting 40tk/s decode with 1000+tk/prefill with a 198B-A11B model on mine

link

EnPissant 2 days ago

I thought MTP wasn't very useful on MoE models because the expert overlap for 2 tokens was too small.

link

girvo 2 days ago

Still helps, and Step 3.5/3.7 were specifically trained for MTP (in a weird triple layer/triple head fashion with a kind of unique architecture)

With the currently-in-PR implementation it doubles decode performance for all the tasks I've been testing it against, at in the worst case is still a 35% uplift, so on a box with heaps of compute and not much memory bandwidth, it's worth it in practice

link