Hacker News new | ask | show | jobs
by elcritch 111 days ago
The LLM/transformers attention layers require an O(n^2) operation between all tokens, which does require significant bandwidth.

Yes the latency hurts performance, that why it’s only achieving ~8tok/s.