|
|
|
|
|
by safarimonkey
1098 days ago
|
|
Some of the techniques improve over linear scaling of the baseline models. For example, from the article: > Conditional computation avoids applying all model parameters to all tokens from the input sequence. [CoLT5] applies heavy computations only to the most important tokens and processes the rest of the tokens with a lighter version of layers. It will speed up both training and inference. [CoLT5]: https://arxiv.org/abs/2303.094752 |
|