Hacker News new | ask | show | jobs
by sunpazed 969 days ago
Well, Metal can only allocate a smaller portion of “VRAM” to the GPU — about 70% or so, see; https://developer.apple.com/videos/play/tech-talks/10580

If you want to run larger models, then CPU inference is your only choice.

1 comments

Aren't these things supposed to have cores dedicated to ml?
You’re thinking of the neural engine. I’m not sure that llama.cpp makes use of this. They’d have to turn it into a CoreML model to do so.
They are not as fast as the GPU (but much lower power).

Also, not many implementations can even use it.