| HN Mirror

Interesting.

I had assumed the performance advantage for MoE came from minimising traffic between GPUs. But if it's per layer routing, then it's going to massively increase inter-gpu traffic compared to vertical slicing.

I guess that means the performance advantage actually comes when batching thousands of queries? The MoE routing would mean that on each MoE layer, each GPU shard gets a batch of queries that will all hit roughly the same subset of experts (and read the same weights from memory). The batches then shuffle between each MoE layer to re-optimise.

It's kind of like GPU raytracing where you get large performance gains by running coherency sorting on rays and batching similar rays together.