Hacker News new | ask | show | jobs
by Kubuxu 514 days ago
I don't think entire GPU is specialised nor a singular token will use the same expert. I think about it as a gather-scatter operation at each layer.

Let's say you have an inference batch of 128 chats, at layer `i` you take the hidden states, compute their routing, scatter them along with the KV for those layers among GPUs (each one handling different experts), the attention and FF happens on these GPUs (as model params are there) and they get gathered again.

You might be able to avoid the gather by performing the routing on each of the GPUs, but I'm generally guessing here.