Hacker News new | ask | show | jobs
by stingraycharles 23 days ago
Also, your local hardware is in no way capable of running the types of models that the cloud providers do, it’s just not economically feasible, and it never will be.
6 comments

Very much dependent on the situation. For many business tasks, local hardware is good enough. But what a lot of folks overlook when saying these things is that (a) workers do more than run AI models on a piece of hardware, (b) significant computer hardware is already sitting idle outside normal work hours, when it can be running batch jobs, and (c) employees can share local hardware.
SanDisk has designed a flash equivalent to HBM, which has 1.6TB/s of bandwidth. I expect that it will be available initially to server manufacturers only, but once supply ramps up will be built into individual machines. At that point it will be practical to run local inference on much larger models. Of course, maybe the SOTA providers will find some way to use even larger ones, but it seems like the returns to scale aren't as much as they were.
> it never will be.

Giving strong “640k is enough for anyone” vibes here.

640k statement was absolute, this one is comparative.

Cloud should have more compute and efficiency than local. I wouldn't be 100% sure, as I don't know what I might not be seeing, but still.

Whether that comparative advantage will matter, though, is a completely different question.

Gotcha, I think I misunderstood the statement as saying today’s cloud-required will never be local-capable.
Depends on what you mean by "economically feasible".

Even very cheap mini-PCs and laptops can run any of the models run by cloud providers, albeit at a much lower speed (i.e. with the weights stored on SSDs).

Whether such a low speed is useful, depends on the application. For something like a coding assistant or bug scanning, an instant response is desirable, but certainly not necessary.

The SSD would wear out in days while the laptop generates two responses a day. This is like saying you could power your home with AA batteries, yes technically you could but in practice entirely infeasible.
There is no wear on the SSDs, because the weights are just read, they are not written during inference.

For model training, the requirements are very different, and the training of a big LLM cannot be done with home equipment. On the other hand, inference can be done on almost any PC, even for LLMs with thousands of billions of parameters, just very slowly.

The only problem is that the inference becomes limited by the SSD reading throughput. Most of the cheap new personal computers available today can read simultaneously only 2 SSDs (if there are more they share a reading path), which are typically 1 PCIe 5.0 SSD and 1 PCIe 4.0 SSD. This has an upper throughput limit of 24 Gbyte/s, with 15 to 20 GB/s achievable in practice.

Then the speed in token/s is limited by the amount of weights that must be read per inference cycle. The ratio between output tokens and the amount of weights that must be read can be improved by various methods, like batching multiple tasks or using speculative decoding.

Does more RAM increase performance? This approach sounds like it could eventually be fast enough for local use as hardware and models improve.
Faster SSD access improves performance more than RAM does, at least until all of the model is being cached in RAM. So older and cheaper HEDT platforms with lots of PCIe lanes to attach storage to are best for this approach.
Weights are write-once data.
It can run open-weight models that are roughly as capable. It's going to be slow unless you're using actual datacenter hardware, but they'll run.
"roughly" is doing a lot of heavy lifting there
The difference between datacenter hardware and cheap personal hardware is not in what can be run and what cannot be run.

Anything can also be run on a cheap computer.

The difference is in speed. A cheap computer may run a big model up to a few orders of magnitude slower than datacenter hardware, depending on whether the LLM is small enough to fit in GPU memory, or it is small enough to fit in CPU memory or it is so big that it must spill on SSDs.

Depending on the application, the tradeoff between run time and run cost may happen to favor using local hardware, despite a much slower speed.

There are plenty of applications where doing them for negligible cost during an overnight job can be preferable to obtaining faster results at a very high price, for instance scanning for bugs in a mature code base using a great number of different open-weights LLMs, which can achieve similar bug coverage like using a single, but overpriced and unavailable SOTA LLM, e.g. Mythos.

> The difference between datacenter hardware and cheap personal hardware is not in what can be run and what cannot be run.

You do realize that a model like Opus is (estimated to be) around 5T parameters, and uses around 5TB of GPU memory?

These kind of things are just impossible to run locally.

This kind of things can certainly be run locally, even on a small mini-PC, like a NUC, or even on a laptop, with the weights stored on SSDs.

Like I have said, the problem is not that they cannot be run, but that they may run more slowly than it is acceptable for a given application. Depending on the model, the speeds reported for inference with weights stored on SSDs vary from one token every few seconds to at most a few tokens per second.

Computers could solve relatively huge problems even in the early days of vacuum tube computers, when the main memories were measured in kilobytes, because at that time it was not expected that the data needed for problem solving must fit inside the main memory or even in the next tier of memory, with magnetic drums or magnetic disks, but the really big problems were solved by a great number of passes over data stored on magnetic tapes.

An LLM whose inference could not be run on a small mini-PC would have to be one hundred times bigger than the biggest existing SOTA LLMs.

Any LLM that exists today can be run on almost any PC, just extremely slowly in comparison with datacenter hardware.

When people say that you "can't do" something what they actually mean is that it's completely impractical (if not impossible).
NEVER will be is a pretty big leap. Never is a long time.