| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by htrp 2 days ago
	I've seen the same, Sparks are great at non time-sensitive tasks. if you can set up a agentic loop that does not require human intervention, you can design around the memory bandwidth limitations

1 comments

girvo 2 days ago

The other benefit is that speculative decoding literally trades compute to make up for low bandwidth, so MTP/EAGLE/DFlash are unreasonably effective on the GB10 IMO, as long as your use case fits it.

I’m getting 40tk/s decode with 1000+tk/prefill with a 198B-A11B model on mine

link

EnPissant 2 days ago

I thought MTP wasn't very useful on MoE models because the expert overlap for 2 tokens was too small.

link

girvo 2 days ago

Still helps, and Step 3.5/3.7 were specifically trained for MTP (in a weird triple layer/triple head fashion with a kind of unique architecture)

With the currently-in-PR implementation it doubles decode performance for all the tasks I've been testing it against, at in the worst case is still a 35% uplift, so on a box with heaps of compute and not much memory bandwidth, it's worth it in practice

link