| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by brucethemoose2 967 days ago
	The GPU can saturate it for sure. Llama.cpp is a pretty extreme cpu ram bus saturator, but I dunno how close it is (and its kind of irrelevant because why wouldn't you use a Metal backend).

1 comments

Well, Metal can only allocate a smaller portion of “VRAM” to the GPU — about 70% or so, see; https://developer.apple.com/videos/play/tech-talks/10580

If you want to run larger models, then CPU inference is your only choice.

Aren't these things supposed to have cores dedicated to ml?

You’re thinking of the neural engine. I’m not sure that llama.cpp makes use of this. They’d have to turn it into a CoreML model to do so.

They are not as fast as the GPU (but much lower power).

Also, not many implementations can even use it.