| HN Mirror

Why not, it's one way to look at it! Although I have yet to see other work with speculative decoding higher than ~1,000 tokens/s., because the other bottlenecks start to matter at that point, and they need to be solved to go further.

Our view is that MTP / speculative decoding could help getting a X multiplier (X = 2 to 6) on the tokens per second speed we currently achieve.

We are a bit greedy, we want to stack optimizations on top of each other to get the maximum speed possible.

It involves additional compute to verify the predicted tokens during the forward pass (it's like a small batch), which should be totally doable for dense models, and will be more tricky for MoEs because it could mean activating more experts and thus more active parameters.