| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rao-v 116 days ago
	Yeah I’ve often wondered why folks aren’t training two tier MoEs for VRAM + RAM. We already have designs for shared experts so it cannot be hard to implement a router that allocated 10x or 100x as often to “core” experts vs the “nice to have” experts. I suppose balancing during training is tricky but some sort of custom loss on the router layers should work. I’ve also wondered why the routers aren’t training to be serially consistent so you can predict layers to swap into VRAM a few layers ahead to maximize available bandwidth.

3 comments

reitzensteinm 116 days ago

I think part of the issue is that in production deployments, you're batching high enough that you'll be paging in those long tail experts constantly.

Unless you're handing that in some kind of fancy way, you'll be holding up the batch while waiting for host memory which will kill your throughout.

It makes much more sense for non batched local inference, especially if you can keep the MoE routing stable like you say, but most folks aren't optimising for that.

link

zozbot234 116 days ago

Ideally, you should rearrange batches so that inference steps that rely on the same experts get batched together, then inferences that would "hold up" a batch simply wait for that one "long tail" expert to be loaded, whereupon they can progress. This might require checkpointing partial inference steps more often, but that ought to be doable.

link

reitzensteinm 116 days ago

I think this is doable for very long tail experts that get swapped in for specialised topics - say, orbital mechanics.

But for experts that light up at, say, 1% frequency per batch, you're doing an awful lot of transfers from DRAM which you amortize over a single token, instead of reads from HBM which you amortize over 32 tokens.

link

rao-v 116 days ago

I think your analysis is right this would make sense mostly for the 30B-3A style models that are mostly for edge / hobbyist use, where context length is precious so nobody is batching.

Given that experts live per layer I dont think it makes sense to have orbital mechanics experts but … I have wondered about swapping out the bottom 10% of layers per topic given that that is likely where the highest order concepts live. I’ve always wondered why people bother with LORA on all layers given that the early layers are more likely to be topic agnostic and focused on more basic pattern assembly (see the recent papers on how LLMs count on a manifold)

link

svnt 116 days ago

Maybe I am misunderstanding something but:

1) This is basically the intention of several recent MoE models: keep particular generally useful experts hot in VRAM.

2) Unless you can swap layers in faster than you consume them there is no point to predicting layers (what does this even really mean? did you mean predicting experts?).

It seems at the moment the best you can do is keep experts and layers more likely to be used for a given query in VRAM and offload the rest, but this is work-dependent.

link

rao-v 115 days ago

So llama.cpp currently statically puts overflow MoE experts in RAM and inferences them on CPU, so you get a mix of GPU + CPU inferencing. You are rooflined by RAM->CPU bandwidth + CPU compute.

With good predictability of MoE, you might see a world were it's more efficient to spend PCI bandwidth (slower than RAM->CPU) on loading MOE experts for the next ~3 layers from RAM to VRAM so you are not rooflined by CPU compute.

VLLM / SGLang (AFAIK) just assume you have enough VRAM to fit all the experts (but will page KV cache to RAM).

link

hedgehog 116 days ago

I don't have links handy but there is active research in this area.

link

rao-v 115 days ago

I'd love any keywords to search for to find active research on this topic!

link

hedgehog 115 days ago

Most of the work I'm aware of starts from the perspective of optimizing inference but the implication that pushing the lessons upstream gets mentioned here and there.

Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models (https://arxiv.org/abs/2505.16056)

Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression (https://arxiv.org/abs/2510.02345)

link