|
|
|
|
|
by oofbey
1565 days ago
|
|
Why would evaluation be faster with more narrow layers? If there were fewer tokens it would definitely be faster, because transformers scale by tokens^2, but here "narrow" means number of channels, for presumably the same number of tokens. |
|