I've seen the same, Sparks are great at non time-sensitive tasks. if you can set up a agentic loop that does not require human intervention, you can design around the memory bandwidth limitations
The other benefit is that speculative decoding literally trades compute to make up for low bandwidth, so MTP/EAGLE/DFlash are unreasonably effective on the GB10 IMO, as long as your use case fits it.
I’m getting 40tk/s decode with 1000+tk/prefill with a 198B-A11B model on mine
Still helps, and Step 3.5/3.7 were specifically trained for MTP (in a weird triple layer/triple head fashion with a kind of unique architecture)
With the currently-in-PR implementation it doubles decode performance for all the tasks I've been testing it against, at in the worst case is still a 35% uplift, so on a box with heaps of compute and not much memory bandwidth, it's worth it in practice
I’m getting 40tk/s decode with 1000+tk/prefill with a 198B-A11B model on mine