Hacker News new | ask | show | jobs
by voxgen 1183 days ago
One can also make a model to learn the necessary context length for each layer and head to save a huge amount of FLOPs: https://arxiv.org/abs/1905.07799