Y
Hacker News
new
|
ask
|
show
|
jobs
by
elcritch
111 days ago
The LLM/transformers attention layers require an O(n^2) operation between all tokens, which does require significant bandwidth.
Yes the latency hurts performance, that why it’s only achieving ~8tok/s.