The 512gb model would have to use a lobotomized quant like q_2 or q_1, and you would still be waiting 3-5 minutes to process context lengths in the 32,000-64,000 token range.
Apple's GPUs are just not very fast for inference. I'd stick to the smaller 7b-18b parameter range or MOE models like Qwen if you want a usable inference speed.
Apple's GPUs are just not very fast for inference. I'd stick to the smaller 7b-18b parameter range or MOE models like Qwen if you want a usable inference speed.