|
|
|
|
|
by andrewgross
502 days ago
|
|
> The beauty of the MOE model approach is that you can decompose the big model into a collection of smaller models that each know different, non-overlapping (at least fully) pieces of knowledge. I was under the impression that this was not how MoE models work. They are not a collection of independent models, but instead a way of routing to a subset of active parameters at each layer. There is no "expert" that is loaded or unloaded per question. All of the weights are loaded in VRAM, its just a matter of which are actually loaded to the registers for calculation. As far as I could tell from the Deepseek v3/v2 papers, their MoE approach follows this instead of being an explicit collection of experts. If thats the case, theres no VRAM saving to be had using an MOE nor an ability to extract the weights of the expert to run locally (aside from distillation or similar). If there is someone more versed on the construction of MoE architectures I would love some help understanding what I missed here. |
|
It doesn’t reduce memory usage, as each subsequent token might require different expert buy it reduces per token compute/bandwidth usage. If you place experts in different GPUs, and run batched inference you would see these benefits.