Y
Hacker News
new
|
ask
|
show
|
jobs
by
zozbot234
63 days ago
CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has access to most (though not all) of the unified memory bandwidth, which is the bottleneck.
1 comments
abhikul0
63 days ago
I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on.
link
zozbot234
63 days ago
Try running CPU-only inference to troubleshoot that. GPU layers will likely just ignore mmap.
link