Hacker News new | ask | show | jobs
by yorwba 502 days ago
The performance advantage comes from doing 1/32 of the floating point operations compared to a dense layer with the same number of parameters.
1 comments

The performance comes mostly from a fraction of memory bandwidth needed, as LLM are mostly memory constrained. Compute matters too, but usually far less than memory.