| HN Mirror

The way I read OP is that it's ultimately highlighting the expense of verifying a possibly-wrong speculated token (in fact, the first wrong token invalidates all subsequent tokens too) which also applies to things like MTP that are a core feature of the model. You can decide you just don't care about matching the accuracy of the original model and skip the verification part altogether, but then you're moving closer to something like a text-diffusion model, with very different tradeoffs involved.