Y
Hacker News
new
|
ask
|
show
|
jobs
by
yorwba
502 days ago
The performance advantage comes from doing 1/32 of the floating point operations compared to a dense layer with the same number of parameters.
1 comments
iamnotagenius
502 days ago
The performance comes mostly from a fraction of memory bandwidth needed, as LLM are mostly memory constrained. Compute matters too, but usually far less than memory.
link