| HN Mirror

Assuming the model tracks convergence in one way or another, it would simply continue performing iterations until it has reached an error below an epsilon value.

This means that in the worst case the number of iterations is the same as a classic autoregressive transformer.

So they are mostly taking advantage of the fact that the average response is in reality not fully sequential, so the model is discovering the exploitable parallelism on its own.

This is not too dissimilar to a branch and bound algorithm that has a worse theoretical runtime than a simple brute force search, but in practice is solving the integer linear programming problem in almost polynomial time, because not everyone is encoding the hardest instances of problems in NP as integer linear programs.