Y
Hacker News
new
|
ask
|
show
|
jobs
by
deeplstm
1944 days ago
Modelling long sequences has always been hard for transformer-based models. This paper proposes a super innovative way for the transformer to cache previously processed tokens. And it makes generation 9X faster. This is truly mind-blowing
Paper
https://arxiv.org/abs/2012.15832
Code
https://github.com/ofirpress/shortformer