I have 128 GB in my computer with a 5850x, it allows me to run and load the 180B falcon and 70B llama2 LLMs in llama.cpp, although with different quantization.
- you don’t get GPU acceleration just by using unified memory. Llama.cpp still only uses the CPU on Apple Silicon chips.
- the difference in tokens/sec is likely attributable to memory bandwidth. Mac Studios with the base Max chip have 400 GB/s memory bandwidth compared to around 50 GB/s for the Ryzen 5000 series CPUs
How fast is your setup?