Hacker News new | ask | show | jobs
by abhikul0 63 days ago
Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.
2 comments

CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has access to most (though not all) of the unified memory bandwidth, which is the bottleneck.
I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on.
Try running CPU-only inference to troubleshoot that. GPU layers will likely just ignore mmap.
For sure I was running on autopilot with that reply. Though in Q4 I would expect it to fit, as 24B-A4B Gemma model without CPU offloading got up to 18GB of VRAM usage