- you don’t get GPU acceleration just by using unified memory. Llama.cpp still only uses the CPU on Apple Silicon chips.
- the difference in tokens/sec is likely attributable to memory bandwidth. Mac Studios with the base Max chip have 400 GB/s memory bandwidth compared to around 50 GB/s for the Ryzen 5000 series CPUs
Edit: Seems some people are getting 1-2.6 tokens/sec on Ryzen (no GPU acceleration), Llama 70B quantized https://www.reddit.com/r/LocalLLaMA/comments/15rqkuw/llama_2...
Whereas Mac Studio gets 13 tokens/sec https://blog.gopenai.com/how-to-deploy-llama-2-as-api-on-mac...