The Mac Studio (and DGX Spark, for that matter) aren't running SOTA-level models by a large margin. Time is money, and waiting on these half-baked solutions is a waste of them both.
Especially concerning the Mac Studio, the GPU is far too weak for enterprise-scale context prefill. You'd need 2 or 4 Studios to process 250k contexts quickly, and even then you'd get bottlenecked by the relatively slow memory bandwidth during the decode stage. It is simply terrible hardware for quick or power efficient inference.
Especially concerning the Mac Studio, the GPU is far too weak for enterprise-scale context prefill. You'd need 2 or 4 Studios to process 250k contexts quickly, and even then you'd get bottlenecked by the relatively slow memory bandwidth during the decode stage. It is simply terrible hardware for quick or power efficient inference.