|
|
|
|
|
by jdeaton
916 days ago
|
|
Thanks thats helpful. One place where the parallelizability of this method falls short of the transformer is not being able to pack multiple varying length examples into the same array during training with block diagonal attention pattern. If I understand correctly thats not possible with this architecture and its an important practical concern in large scale transformer training. |
|