|
|
|
|
|
by thegeomaster
1245 days ago
|
|
There's already research that tries to fix this problem with transformers in general, like Transformer-XL [1]. I'm a bit puzzled that I don't see much interest in getting a pre-trained model out that uses this architecture---it seems to give good results. [1]: https://arxiv.org/abs/1901.02860 |
|