I was wondering if Apple Silicon would be uniquely suited for high-GPU-RAM tasks because it shares memory across the system. But I guess in this case it's a CPU model, so that's unrelated. Is that right? Do you think you could run these models on GPU instead?
With 16 threads, about 140ms per token for 30B, 300ms per token for 65B
I should also mention that 65B should be able to run on 64GB systems. Total system memory consumption on M1 Ultra is about 67GB when running nothing else.