|
|
|
|
|
by apstroll
345 days ago
|
|
Extremely doubtful that it boils down to quadratic scaling of attention. That whole issue is a leftover from the days of small bert models with very few parameters. For large models, compute is very rarely dominated by attention. Take, for example, this FLOPs calculation from https://www.adamcasson.com/posts/transformer-flops Compute per token = 2(P + L × W × D) P: total parameters
L: Number of Layers
W: context size
D: Embedding dimension For Llama 8b, the window size starts dominating compute cost per token only at 61k tokens. |
|