Hacker News new | ask | show | jobs
by olddustytrail 476 days ago
How does producing tokens in parallel not just result in completely incoherent output?
2 comments

Assuming the model tracks convergence in one way or another, it would simply continue performing iterations until it has reached an error below an epsilon value.

This means that in the worst case the number of iterations is the same as a classic autoregressive transformer.

So they are mostly taking advantage of the fact that the average response is in reality not fully sequential, so the model is discovering the exploitable parallelism on its own.

This is not too dissimilar to a branch and bound algorithm that has a worse theoretical runtime than a simple brute force search, but in practice is solving the integer linear programming problem in almost polynomial time, because not everyone is encoding the hardest instances of problems in NP as integer linear programs.

The short answer is that we do more than one parallel pass over multiple tokens: we iteratively refine them over a few passes to fix incoherences. This can be seen as a generalization of diffusion algorithms that underlie systems like Midjourney or Sora.
so if I understand correctly, you remask some tokens that were previously unmasked?