Hacker News new | ask | show | jobs
by bufo 967 days ago
It was pretty hard to saturate the memory bandwidth on the M2 on the CPU side (not sure about the GPU).
1 comments

The GPU can saturate it for sure.

Llama.cpp is a pretty extreme cpu ram bus saturator, but I dunno how close it is (and its kind of irrelevant because why wouldn't you use a Metal backend).

Well, Metal can only allocate a smaller portion of “VRAM” to the GPU — about 70% or so, see; https://developer.apple.com/videos/play/tech-talks/10580

If you want to run larger models, then CPU inference is your only choice.

Aren't these things supposed to have cores dedicated to ml?
You’re thinking of the neural engine. I’m not sure that llama.cpp makes use of this. They’d have to turn it into a CoreML model to do so.
They are not as fast as the GPU (but much lower power).

Also, not many implementations can even use it.