Hacker News new | ask | show | jobs
by strin 2483 days ago
> In our experiments with Transformers, we observed that not all the attention heads utilize their attention span to the fullest. In fact, in a task of character-level language modeling, most of the heads were using only a small portion of their attention span. If we can take advantage of this property during training, we can reduce the computation time and memory footprint significantly