Hacker News new | ask | show | jobs
by sathackr 1 day ago
The opposite of that has been happening for 20 years now with cloud compute.

It won't happen with AI models either.

It's almost ingrained in the American business model now. Outsource everything. Nobody wants to manage a room full of servers when they can spend 2-3x as much and outsource that headache along with the responsibility for it.

Same will happen with AI. Whether that means paying Anthropic that premium or paying AWS.

I'm in a relatively small business, we recently had an outage related to our local infrastructure.

I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.

Everyone wants to shuck the chore and the responsibility.

14 comments

> The opposite of that has been happening for 20 years now with cloud compute. It won't happen with AI models either.

AI is different.

Cloud computing genuinely is cheaper on average. It's better than paying for cisco servers, and at scale, it's cheaper than managed platforms (ala Heroku), and it's a coin toss for when you're in the middle ground and constantly approaching the point of rebuilding poor-man versions of existing products but with very very expensive engineering salaries.

In contrast, local models offer dramatic savings, and are magnitude of orders better in certain aspects: like stability - the performance is all over the place with traditional AI companies as they divert compute to their next big thing.

The benefits to maintaining your own infrastructure are pretty moderate to low, with very high risk.

And also, alternate models are pretty easy to use and easy to swap out unlike the vendor lock-in that exists with cloud services.

> AI is different.

I agree. The other thing here is that, once you can run LLMs on a single piece of commodity hardware (whether that includes one GPU or several), the difference between cloud vs. on-premise LLMs will largely be about where your hardware is located. There will be very little software configuration involved (just an HTTP endpoint that talks to the GPU). This is decidedly different from cloud products where the moat of hyperscalers is largely in the software and services on top of the hardware, not the hardware itself. (Sure, GPUs will eventually break & need replacement, too, but there's no state to lose, so that's already orders of magnitude easier than replacing hard drives.)

There's also a difference in the cost of downtime. A server hosting your website or SaaS, if it's down for five minutes, costs you a lot of real revenue. So you plan for redundancy, you set up automatic failover so that if one node goes down the next node can handle the load while the first one reboots, and so on. But for the LLM that's just serving your local model? You can tell everyone "Hey, we're taking it down for a 15-minute window, so plan your lunch break while it's down". Unplanned downtime can interrupt what people were doing and cost you productivity and thus money, but it's a lot easier to schedule planned downtime and have people work on non-model-using tasks during those periods: the model is helpful, but not essential.
> Cloud computing genuinely is cheaper on average.

For some applications, sure. Availability is a large part of what one is paying for with cloud computing, but it's also something that not every business needs.

If you sacrifice availability and have a pure-compute use case (low durability requirements), on-prem can quickly end up cheaper for far better hardware.

There's no economic reason why running a model locally should be better than using a cloud hosted version.
“There is no reason anyone would want a computer in their home." - Ken Olson, Founder of Digital Equipment Corporation, in 1977
In hindsight this is getting truer, what with the push of dumb terminal for everyone
You pay a 3x markup to rent a server through AWS than managing your own. You pay for convenience. At shall annals that's fine, but for large companies with their own datacenters, you generally do things in house.
Sure there is. Keeping your IP in house.
AI is different because you can't encrypt it. An running on someone else's hardware is basically just 'trust me bro! I won't read it!'. Of course you can say that about I.e. database too, but at least you can run it on your own dedicated hardware in some datacenter, so it is password protected, you can encrypt it at rest and you will only know the key.

With AI, no, you can't . model needs plain text to be able to work. If somebody will be able to figure out models with asymmetric keys will make a lot of money.

For many companies (country-dependent) that's not really why they use cloud services vs purchasing. It's tax shenanigans and business process overhead. OpEx vs CapEx, and a small (%) bump in the huge AWS bill no one will even notice or a $30k+ invoice for hardware that has to go through rigorous review and 3 departments.

Same reason people pay for things through the AWS marketplace (like Vanta) instead of having to go through their invoicing process.

Good point. Maybe there'll be companies that maintain your on-premise GPU cluster just like there are companies that service the coffee machine in your office?
This is far more likely than everyone racking their own servers.
> on-premise GPU cluster

Renting a GPU server from a cloud and hosting your own llama.cpp is the path of least resistance.

It's just not comparable though is it? You need cloud services because it's physically impossible to use your single home computer as a server, CDN, load balancer, mass storage, security service, and distributed system.

But AI is just weights, you can run a reasonably intelligent model at home, or on a few GPUs if you're a small-medium sized company, and it doesn't require dedicated maintenance.

If you're a medium-large company, you should definitely run your own AI because you can max out the CPUs more often. You're not only able to run privately and locally, but you're also able to run efficiently.
> I got pressure from the CEO saying it wasn't reliable to host our own infrastructure anymore even though our total internal down time over the last 5 years is significantly less than even a single of the larger recent AWS outages.

Same here. My job as a software dev does not require me to self-host services we need and use. Quite the opposite. But, I am reluctant to hand over all control to AWS or equivalent for several reasons that I will get into here.

I have found that Infrastructure as Code (IaC) and modern tools like opentofu, ansible, combined with frontier AI models and harnesses gives you superpowers in this space. Almost all of our self-hosted services are fully managed by these tools. e.g. We perform backups and test them more often now than we ever did before. Entirely because it is so much easier to do all of that now.

> It won't happen with AI models either.

AI is definitely different. Cloud compute is incredibly convenient to the point where even if AWS is more expensive it's just so _nice_. LLM models are much more abstract and while I can't easily swap AWS for Hetzner to save 80% of my costs I can absolutely get close to that for many of LLM tasks, even today.

I suspect Anthropic and gang all know that that's why they are buying up dev tools and shifting towards long-running agents because that's where they can get AWS's "nicesness" that they can charge for.

There is efficiency in the cloud model for models. So maybe there is a scope for Apple or an "Apple for AI" in the AI compute game - mainly from the perspective of privacy etc.

And once the servers are in space, everything is fully out there.

IMO local-vs-cloud may be a misleading dichotomy, versus:

    1. Individual dev machines
    2. Shared local server
    3. Shared server in corporate cloud
    4. Third-party LLM SaaS provider
Even if you don't want your laptop melting, there are still some important differences between 3 and 4 in terms of data privacy and security.
I suppose cloud won because: - nobody wants to deal with the networking stack on the internet - you want servers alive all the time - it's businesses running their software on servers to serve to customers

Do these apply to AI?

on prem cloud is harder because of the scale up and scale down requirements. If you are a growing business which most decent ones are, you constantly have to think about that.
> Everyone wants to shuck the chore and the responsibility.

Which gives all the power to the big techs. I'll never understand why the average company seems to have no problem with this.

It's a longstanding management principle, so old that people may not even say it explicitly any more, which states "focus on your core competencies," the corollary of which is "outsource anything that is not a core competency."

I can see how it makes sense for companies, because money is "only money" but an ongoing operational distraction can be much more costly, as in, it can be detrimental to the success of the overall business.

Did you build your own house using tools that you forged from iron-rich ore yourself? Did you grow your own wheat to make bread for your lunchtime sandwich today?

There's a reason most people pay other people to do these things for them.

That's an interesting take, however there is no ongoing maintenance related to local models, maybe the only effort is giving more capable machines to the workforce; but yeah I can see how it might feel like a barrier.
The hardware, the power systems, the cooling systems. They need maintenance.

The OS needs updates, file systems get corrupted.

Fans get dirty.

All the things that you need to deal with in hosting your own server infrastructure you have to deal with when hosting your own AI infrastructure (which runs on servers...)

However, you can get many of the benefits of a "local model" by outsourcing all the hardware maintenance but still using an open model. Guaranteed repeatability for one.

A lot of the reason people outsource normal software is its brittle security properties, not sure that even applies to an LLM - it can go and look up the latest security best practices just like an engineer can.

Still though, perhaps the existence of low-margin, generic, cloud LLM's puts some downward pressure on the 'brand name' companies?
> in the American business model

AI company valuations won't survive if they're only for the "American business model".

Exactly. American businesses aren't even particularly efficient or well run
outsource that headache along with the responsibility for it

You know what gives me headaches? When I'm in the middle of a session and the model gets rug-pulled out from under me because somebody at the model provider didn't pay the Trump bill that month.

Or when someone at the model provider decides that the curve-fitting algorithm in my graphics package looks a little too much like Skynet for comfort.

Or when they do any number of other things to undermine my work for the sake of their business model, some of which I won't even notice until the damage is done.

The sad thing is, if you know how inference works, you know that it really is insanely wasteful for everybody to run it locally. If anything naturally belongs in the cloud, it's inference. But at the same time, what choice are we being given?

What about inference suggests it naturally belongs in the cloud?
Inference basically looks like this (neglecting a whole bunch of stuff):

    for t in tokens_in_context
        for p in model_weights
            do something with p*t
The expensive part is fetching each weight from memory, which is why VRAM/HBM is such a big deal. Conceptually, for a huge, dense (non-MoE) model, the inner loop might run a trillion times for every token generated.

Obviously that's not how it really works in practice, but the point is, if you are only running one prompt at a time, each weight gets fetched, applied to the token being processed, and then never touched again until the next token is processed.

So when you submit a prompt to a model that's running a bunch of other peoples' contexts concurrently, it can reuse each weight multiple times before moving on to the next one:

    for p in model_weights
        for u in users
           for t in u's context
              do something with p*t
The same is true in an agent-heavy scenario where you have several contexts in play at once.

Worst case, in terms of energy efficiency, is a single user sitting around waiting for a single response. I don't feel like I'm explaining it well, but the core idea is that every time a weight is fetched from memory, you want to get as much work done as possible with it.

That makes a lot of sense, thank you. I think a pirate cloud of local models could make sense, but that would be regulated into oblivion