Hacker News new | ask | show | jobs
by zozbot234 81 days ago
With proper mmap support you don't really need the entire model in memory. It can be streamed from a fast SSD, and this is more useful for MoE models where not all expert-layers are uniformly used. Of course the more data you stream from SSD, the slower this is; caching stuff in RAM is still relevant to good performance.
2 comments

Okay, yes, you don’t need the entire MoE model in memory for it to function.

But you still need the working set of frequently used experts to actually fit in RAM, or at least stay cached. Expert routing happens per token, per layer. If those weights aren’t resident, you’re effectively pulling them from disk on the critical path of generation — over and over again.

That’s not “just slower,” that’s order of magnitude slower. You’ll end up with constant page faults and page cache churn. And if swap is on the same device as the model, you’re now competing for bandwidth on top of that.

IMO the main benefit of mmap is ability to reclaim cold pages during high memory-pressure events when model isn't active.

You can do this on a Mac as well tho, right? So that 128 GB unified memory becomes cache for very fast 1+ TB Apple SSD.
I think the advantage of Flash-MoE compared to plain mmap is mostly the coalesced representation where a single expert-layer is represented by a single extent of sequential data. That could be introduced to existing binary formats like GGUF or HF - there is already a provision for differently structured representations, and that would easily fit.