| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bufo 967 days ago
	It was pretty hard to saturate the memory bandwidth on the M2 on the CPU side (not sure about the GPU).

1 comments

brucethemoose2 967 days ago

The GPU can saturate it for sure.

Llama.cpp is a pretty extreme cpu ram bus saturator, but I dunno how close it is (and its kind of irrelevant because why wouldn't you use a Metal backend).

link

sunpazed 967 days ago

Well, Metal can only allocate a smaller portion of “VRAM” to the GPU — about 70% or so, see; https://developer.apple.com/videos/play/tech-talks/10580

If you want to run larger models, then CPU inference is your only choice.

link

__loam 966 days ago

Aren't these things supposed to have cores dedicated to ml?

link

azinman2 965 days ago

You’re thinking of the neural engine. I’m not sure that llama.cpp makes use of this. They’d have to turn it into a CoreML model to do so.

link

brucethemoose2 965 days ago

They are not as fast as the GPU (but much lower power).

Also, not many implementations can even use it.

link