| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by apstroll 392 days ago

Extremely doubtful that it boils down to quadratic scaling of attention. That whole issue is a leftover from the days of small bert models with very few parameters.

For large models, compute is very rarely dominated by attention. Take, for example, this FLOPs calculation from https://www.adamcasson.com/posts/transformer-flops

Compute per token = 2(P + L × W × D)

P: total parameters L: Number of Layers W: context size D: Embedding dimension

For Llama 8b, the window size starts dominating compute cost per token only at 61k tokens.