| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 0x457 79 days ago

Okay, yes, you don’t need the entire MoE model in memory for it to function.

But you still need the working set of frequently used experts to actually fit in RAM, or at least stay cached. Expert routing happens per token, per layer. If those weights aren’t resident, you’re effectively pulling them from disk on the critical path of generation — over and over again.

That’s not “just slower,” that’s order of magnitude slower. You’ll end up with constant page faults and page cache churn. And if swap is on the same device as the model, you’re now competing for bandwidth on top of that.

IMO the main benefit of mmap is ability to reclaim cold pages during high memory-pressure events when model isn't active.