| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by abhikul0 63 days ago
	Mac has unified memory, so 36GB is 36GB for everything- gpu,cpu.

2 comments

zozbot234 63 days ago

CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has access to most (though not all) of the unified memory bandwidth, which is the bottleneck.

link

abhikul0 63 days ago

I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on.

link

zozbot234 63 days ago

Try running CPU-only inference to troubleshoot that. GPU layers will likely just ignore mmap.

link

mhitza 63 days ago

For sure I was running on autopilot with that reply. Though in Q4 I would expect it to fit, as 24B-A4B Gemma model without CPU offloading got up to 18GB of VRAM usage

link