Hacker News new | ask | show | jobs
by deeplstm 1944 days ago
Modelling long sequences has always been hard for transformer-based models. This paper proposes a super innovative way for the transformer to cache previously processed tokens. And it makes generation 9X faster. This is truly mind-blowing

Paper https://arxiv.org/abs/2012.15832

Code https://github.com/ofirpress/shortformer