|
|
|
|
|
by rao-v
116 days ago
|
|
Yeah I’ve often wondered why folks aren’t training two tier MoEs for VRAM + RAM. We already have designs for shared experts so it cannot be hard to implement a router that allocated 10x or 100x as often to “core” experts vs the “nice to have” experts. I suppose balancing during training is tricky but some sort of custom loss on the router layers should work. I’ve also wondered why the routers aren’t training to be serially consistent so you can predict layers to swap into VRAM a few layers ahead to maximize available bandwidth. |
|
Unless you're handing that in some kind of fancy way, you'll be holding up the batch while waiting for host memory which will kill your throughout.
It makes much more sense for non batched local inference, especially if you can keep the MoE routing stable like you say, but most folks aren't optimising for that.