| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by londons_explore 503 days ago

Rmsnorm acts like a barrier. No compute on the next network layer can start before all compute in the previous layer is done.

Splitting networks across multiple GPU's, this means you must wait for the slowest node and the longest latency.

As soon as you can remove most of these barriers, compute over non-latency-guaranteed networks becomes more practical, as does non-homogeneous compute (ie. Mixing different GPU models).

1 comments

elcritch 503 days ago

What are other barriers in transformers? Or is the normalization layer the primary one?

link

woadwarrior01 503 days ago

dot-product attention is the biggest barrier. This is why there are so many attempts to linearize it.

link

amitport 502 days ago

that fail... linearization is a bad idea. But plenty of other optimizations are done

link