Y
Hacker News
new
|
ask
|
show
|
jobs
by
voxgen
1183 days ago
One can also make a model to learn the necessary context length for each layer and head to save a huge amount of FLOPs:
https://arxiv.org/abs/1905.07799