Hacker News new | ask | show | jobs
by duchenne 36 days ago
Cloud models can use batch processing which is significantly more efficient. A local model has basically a batch of one which takes as much time to process as a batch of 100 because the gpu is memory bound and spend most of its time loading the model from vram to the gpu cache while the gpu cores are idle. With a batch of 100 the model loading time and compute time are roughly similar. So local Models have a first 100x lower efficiency. Secondly, local models are idle most of the time waiting for the user to write a prompt, so the efficiency gap is probably more around 1000x.
2 comments

It's an interesting point but local gpu efficiency is not something I think about when I'm being rate limited or when my subscription costs keep rising.
I think folks in this thread are underestimating how expensive it is to serve a SoTA model at 100 tokens a second. In addition to the $500k in capital costs, you also have significant electricity costs.

This stuff is expensive because supply is much lower than demand. If everyone was to run their own hardware with a batch size of 1, we'd have 100x more demand for inference hardware and electricity than we do now, and people would be even more frustrated. Efficiency is everything, and we need all the economies of scale we can get to meet demand.

But that's why you shouldn't expect local models to provide quick real-time answers, at least not with the same smarts as SOTA models running in the cloud. Slow batched inference (if possible - RAM capacity can obviously be a challenge with typical models and end-user hardware) can be a lot more effective.
My point is that it is WAY more efficient if we put the world's DRAM supply into a shared inference pool instead of stranding it in local machines where it won't have as high of batch size or utilization.

The cost of not being efficient is even higher DRAM costs than we have now, given supply and demand.

Much of the world's DRAM stock is sitting idle in consumers' local machines and on-prem servers. If that DRAM gets some use, even "inefficiently", that's a meaningful decrease in demand.
That DRAM would get even more use if it was removed from these machines and placed into a shared pool :) I joke, but thanks to the brutal DRAM market there has been some movement in this direction lately...
And what if your local computer essentially has an model chip with dedicated memory where the model stays loading 100% of the time?