Y
Hacker News
new
|
ask
|
show
|
jobs
by
qeternity
385 days ago
Attention is not the primary inference bottleneck. For each token you have to load all of the weights (or activated weights) from memory. This is why Cerebras is fast: they have huge memory bandwidth.