Hacker News new | ask | show | jobs
by r0b05 42 days ago
It's an interesting point but local gpu efficiency is not something I think about when I'm being rate limited or when my subscription costs keep rising.
1 comments

I think folks in this thread are underestimating how expensive it is to serve a SoTA model at 100 tokens a second. In addition to the $500k in capital costs, you also have significant electricity costs.

This stuff is expensive because supply is much lower than demand. If everyone was to run their own hardware with a batch size of 1, we'd have 100x more demand for inference hardware and electricity than we do now, and people would be even more frustrated. Efficiency is everything, and we need all the economies of scale we can get to meet demand.

But that's why you shouldn't expect local models to provide quick real-time answers, at least not with the same smarts as SOTA models running in the cloud. Slow batched inference (if possible - RAM capacity can obviously be a challenge with typical models and end-user hardware) can be a lot more effective.
My point is that it is WAY more efficient if we put the world's DRAM supply into a shared inference pool instead of stranding it in local machines where it won't have as high of batch size or utilization.

The cost of not being efficient is even higher DRAM costs than we have now, given supply and demand.

Much of the world's DRAM stock is sitting idle in consumers' local machines and on-prem servers. If that DRAM gets some use, even "inefficiently", that's a meaningful decrease in demand.
That DRAM would get even more use if it was removed from these machines and placed into a shared pool :) I joke, but thanks to the brutal DRAM market there has been some movement in this direction lately...
I think the question of who controls the model is far more pressing than the question of who owns the DRAM.

It's easy to rattle off a half-dozen different vectors of likely enshittification over the next few years -- ranging from increasing censorship, to lower rate limits, to removal of existing features and forced addition of unwelcome new ones, to extortionate price increases, to unexplained and irreversible account bans. The only way to avoid them all is by running weights you own on hardware you control.

How smart and how fast is your local model? Those are certainly important questions, but "Does it exist at all?" is more important.