Hacker News new | ask | show | jobs
by zozbot234 36 days ago
But that's why you shouldn't expect local models to provide quick real-time answers, at least not with the same smarts as SOTA models running in the cloud. Slow batched inference (if possible - RAM capacity can obviously be a challenge with typical models and end-user hardware) can be a lot more effective.
1 comments

My point is that it is WAY more efficient if we put the world's DRAM supply into a shared inference pool instead of stranding it in local machines where it won't have as high of batch size or utilization.

The cost of not being efficient is even higher DRAM costs than we have now, given supply and demand.

Much of the world's DRAM stock is sitting idle in consumers' local machines and on-prem servers. If that DRAM gets some use, even "inefficiently", that's a meaningful decrease in demand.
That DRAM would get even more use if it was removed from these machines and placed into a shared pool :) I joke, but thanks to the brutal DRAM market there has been some movement in this direction lately...
I think the question of who controls the model is far more pressing than the question of who owns the DRAM.

It's easy to rattle off a half-dozen different vectors of likely enshittification over the next few years -- ranging from increasing censorship, to lower rate limits, to removal of existing features and forced addition of unwelcome new ones, to extortionate price increases, to unexplained and irreversible account bans. The only way to avoid them all is by running weights you own on hardware you control.

How smart and how fast is your local model? Those are certainly important questions, but "Does it exist at all?" is more important.

There isn't enough hardware in the world for everyone to run their own SoTA model. The only hope we have is if we work together to host these on shared infrastructure, benefiting from >50x economies scale due to batching, etc. That infrastructure doesn't have to be owned by greedy corporations.