Hacker News new | ask | show | jobs
by jdeaton 916 days ago
Thanks thats helpful. One place where the parallelizability of this method falls short of the transformer is not being able to pack multiple varying length examples into the same array during training with block diagonal attention pattern. If I understand correctly thats not possible with this architecture and its an important practical concern in large scale transformer training.