|
|
|
|
|
by zozbot234
81 days ago
|
|
With proper mmap support you don't really need the entire model in memory. It can be streamed from a fast SSD, and this is more useful for MoE models where not all expert-layers are uniformly used. Of course the more data you stream from SSD, the slower this is; caching stuff in RAM is still relevant to good performance. |
|
But you still need the working set of frequently used experts to actually fit in RAM, or at least stay cached. Expert routing happens per token, per layer. If those weights aren’t resident, you’re effectively pulling them from disk on the critical path of generation — over and over again.
That’s not “just slower,” that’s order of magnitude slower. You’ll end up with constant page faults and page cache churn. And if swap is on the same device as the model, you’re now competing for bandwidth on top of that.
IMO the main benefit of mmap is ability to reclaim cold pages during high memory-pressure events when model isn't active.