Hacker News new | ask | show | jobs
by rhaps0dy 916 days ago
Its latent space transition is linear, instead of nonlinear, so there's a more parallelizable algorithm for advancing time in it. This makes it much more efficient to train and do inference with in GPUs.

The way it keeps all the representation power of LSTMs is by having the transition vary with the input (but still be linear).

1 comments

Thanks thats helpful. One place where the parallelizability of this method falls short of the transformer is not being able to pack multiple varying length examples into the same array during training with block diagonal attention pattern. If I understand correctly thats not possible with this architecture and its an important practical concern in large scale transformer training.