| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by summarity 1200 days ago
	No catch, just works. 30B works fine on an M1 Max with 64GB of RAM, had to go for the M1 Ultra at 128GB for 65B.

4 comments

cjbprime 1200 days ago

I was wondering if Apple Silicon would be uniquely suited for high-GPU-RAM tasks because it shares memory across the system. But I guess in this case it's a CPU model, so that's unrelated. Is that right? Do you think you could run these models on GPU instead?

link

bojangleslover 1200 days ago

I'm not able to run 13B and from his wiki:

> Currently, only LLaMA-7B is supported since I haven't figured out how to merge the tensors of the bigger models.

link

simonw 1200 days ago

This commit landed 7 hours ago (since I wrote my TIL): https://github.com/ggerganov/llama.cpp/commit/007a8f6f459c6e...

link

minxomat 1200 days ago

This has been fixed almost 2 days ago now. It’s literally mentioned at the top of the repo.

link

detrites 1200 days ago

What's the tokens/s on those?

link

minxomat 1200 days ago

With 16 threads, about 140ms per token for 30B, 300ms per token for 65B

I should also mention that 65B should be able to run on 64GB systems. Total system memory consumption on M1 Ultra is about 67GB when running nothing else.

link

xiphias2 1200 days ago

You have both at home / work?

link

minxomat 1200 days ago

A laptop and a desktop (Mac Studio)

link