Hacker News new | ask | show | jobs
by summarity 1200 days ago
No catch, just works. 30B works fine on an M1 Max with 64GB of RAM, had to go for the M1 Ultra at 128GB for 65B.
4 comments

I was wondering if Apple Silicon would be uniquely suited for high-GPU-RAM tasks because it shares memory across the system. But I guess in this case it's a CPU model, so that's unrelated. Is that right? Do you think you could run these models on GPU instead?
I'm not able to run 13B and from his wiki:

> Currently, only LLaMA-7B is supported since I haven't figured out how to merge the tensors of the bigger models.

This commit landed 7 hours ago (since I wrote my TIL): https://github.com/ggerganov/llama.cpp/commit/007a8f6f459c6e...
This has been fixed almost 2 days ago now. It’s literally mentioned at the top of the repo.
What's the tokens/s on those?
With 16 threads, about 140ms per token for 30B, 300ms per token for 65B

I should also mention that 65B should be able to run on 64GB systems. Total system memory consumption on M1 Ultra is about 67GB when running nothing else.

You have both at home / work?
A laptop and a desktop (Mac Studio)