Hacker News new | ask | show | jobs
by tfirst 5 days ago
If model performance continues to scale with model size, I have a hard time seeing how local models will have any chance of competing with models hosted on datacenter hardware.

1. There are strong economies of scale in hosting inference (batched prompts, high uptime, shared infrastructure).

2. There are physical limits on how much memory we will be able to produce over the next few years. Demand will probably scale at least as fast as production does, so we won't be saved by falling prices.

5 comments

For your first point- You've just repeated "shared tenant." A scaling factor that's been used since before the turn of the milenium. Uptime is, as always, an irrelevancy for personal/homelab vs cloud. It shifts from uptime to pure financial (capex first, then how you account for "wasted" time).

2) The current memory crunch is more political than cyclical. The only reason we have fabs as far intro construction as we do is CHIPS Act. Which, predates LLMs public existance by more than 6mo. the horrific silicon prices are a direct result of openAI's openly Illegal dealings. Their pretense of needing it for stargate gets sundered further with each missed or cancelled deadline.

They predicted the political and regulatory outcome superbly.

That's a big if.

The big commercial models seem to gain far more from pre-processing than they do from size, and you can already run pretty useful models on desktop hardware.+

Check out this video about how DeepMind significantly improved performance: https://youtu.be/Dkqzqw8rxXI They basically ran the LLM tuning through an old-school genetic or annealing style algorithm and trounced what a larger model could do alone.

There's a value for many people and organizations with running a model locally on hardware they fully own and control (or pay to colocate in a datacenter somewhere) vs running a model on something owned/controlled by any third party. For highly privacy sensitive, medical applications, etc. It's not just a question of raw efficiency in dollar per tokens per day or tokens/second.
Cloud models will always be ahead, but not every task needs Fable-level intelligence. The number of usable situations for local models will increase as hardware and open-weight models improve.
GLM hasn't ballooned model size yet is ridiculously scaled?