Hacker News new | ask | show | jobs
by gizajob 15 days ago
I really don’t get it - why not put a Mac Studio with 128gb of ram on every engineers desk and be like “engineer, engineer your local LLM”. Makes no sense to be spending $20-30,000+ per year on cloud providers when Qwen et al are available. And even less sense to be sending all your company code and data to Anthropic and OpenAI when you can keep all that IP in the building.
4 comments

The Mac is very feeble compared to the big iron that the providers run so will be much lower performance. Also many companies would prefer engineers work on the domain problems instead of working on novel LLMs.
I meant “roll your own” LLM for use not build new ones.
The Mac Studio (and DGX Spark, for that matter) aren't running SOTA-level models by a large margin. Time is money, and waiting on these half-baked solutions is a waste of them both.

Especially concerning the Mac Studio, the GPU is far too weak for enterprise-scale context prefill. You'd need 2 or 4 Studios to process 250k contexts quickly, and even then you'd get bottlenecked by the relatively slow memory bandwidth during the decode stage. It is simply terrible hardware for quick or power efficient inference.

because local models which can run well using 128gb ram are still not SOTA, yes Qwen is amazing, but nor Qwen 27B neither 35B can outperform Opus 4.6, so why increase rework for your engineers even more, if you can pay slightly more and always use SOTA, until others figure out best practices for running local SOTA's
Because it’s cheaper to pay for the tokens than to pay their engineers to worry about a worse, homebrewed setup.
sota models cannot remotely fit in 128gb