Hacker News new | ask | show | jobs
by londons_explore 456 days ago
Rmsnorm acts like a barrier. No compute on the next network layer can start before all compute in the previous layer is done.

Splitting networks across multiple GPU's, this means you must wait for the slowest node and the longest latency.

As soon as you can remove most of these barriers, compute over non-latency-guaranteed networks becomes more practical, as does non-homogeneous compute (ie. Mixing different GPU models).

1 comments

What are other barriers in transformers? Or is the normalization layer the primary one?
dot-product attention is the biggest barrier. This is why there are so many attempts to linearize it.
that fail... linearization is a bad idea. But plenty of other optimizations are done