Hacker News new | ask | show | jobs
by qeternity 385 days ago
Attention is not the primary inference bottleneck. For each token you have to load all of the weights (or activated weights) from memory. This is why Cerebras is fast: they have huge memory bandwidth.