| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sunpazed 969 days ago
	Well, Metal can only allocate a smaller portion of “VRAM” to the GPU — about 70% or so, see; https://developer.apple.com/videos/play/tech-talks/10580 If you want to run larger models, then CPU inference is your only choice.

1 comments

Aren't these things supposed to have cores dedicated to ml?

You’re thinking of the neural engine. I’m not sure that llama.cpp makes use of this. They’d have to turn it into a CoreML model to do so.

They are not as fast as the GPU (but much lower power).

Also, not many implementations can even use it.