Why is Chat GPT so expensive to operate? | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Why is Chat GPT so expensive to operate?
	82 points by beavis000 1248 days ago
	Altman has said "it's a few cents per chat", which probably means it closer to high single digit cents per chat. Does that estimate include amortization of upfront development costs, or is it actually the marginal cost of a chat?

13 comments

vineyardmike 1248 days ago

All these answers are good, but I can share more concrete numbers…

Meta released their OPT model which they claim is comparable to the GPT-3 model. Guidance for running that model [1] suggests a LOT of memory - at least 350GB of gpu memory which is roughly 4 A1000s, which are pricy.

Running this on AWS with the above suggestion would cost $25/hr - just for one model running. That’s almost $0.50 a minute. If you imagine it takes a few seconds to run the model for one request… easily you’ll hit $0.05 per request once you factor in the rest of the infra (storage, CDN, etc) and the engineering cost, and the research cost, and the fact that they probably have a scale to hundreds of instances for heavy traffic and that may mean less efficient purchased servers.

OpenAI has a sweetheart deal with Azure, but this is roughly the cost structure for serving requests. And this doesn’t include the upfront cost of training.

https://alpa.ai/tutorials/opt_serving.html

mr_00ff00 1248 days ago

Really makes you appreciate the brain, which presumably operates with some sort of similar demand.

unsupp0rted 1248 days ago

Hard to tell. Similar to how it takes a lot of resources for a human to hang from monkey bars but for a sloth it takes basically no resources at all, because the sloth comes out of the box designed for it.

nocsi 1248 days ago

Human babies come out of the box designed for hanging from monkey bars as well.

https://youtu.be/jXJLaGguQiU

smnrchrds 1248 days ago

Another mind-boggling thing about brain is how little power it uses to do all the complex things it does.

wordpad25 1248 days ago

calories are a unit of energy, so it’s a straight forward comparison

if we assume that a computer can be powered by 100 watts, over a day it will use 2.4 kW h, which is about 2000 Calories

GPU will consume a lot more, but we aren’t that far off in efficiency

pattrn 1248 days ago

Doesn't that assume 100% of a human's daily calories burn is due to brain activity?

unsupp0rted 1248 days ago

The brain uses about 20% of a human's calories. It's not 100%, but it's a substantial fraction.

sannee 1247 days ago

The other components of the human body are also required for brain function.

imtringued 1248 days ago

The brain doesn't use a synchronous digital architecture. It is asynchronous. Spiking neural networks implemented in neuromorohic hardware are equally efficient. They consume milliwatts for a million neurons.

awesomeMilou 1248 days ago

Do you have links on novel hardware architectures for neuromorphic hardware? In my country , the leading research group for neuromorphic computing does not cite any novel hardware approaches, only what existing hw architectures are most suitable.

iosystem 1247 days ago

How do you know that the universe isn't just rendering everything.

_8j50 1248 days ago

To have ML produce meaningful content you need tp give it some input or a sense of what the outcome should be and this is after billions of trial and errors.

Yet people these days believe something like the brain was bruteforced by nature into an accidental existence.

ericathegreat 1248 days ago

Some input: The organism's environment.

Outcome should be: The organism successfully produces offspring

Natural selection is doing exactly what you describe.

_8j50 1248 days ago

Except natural selection can't start over. It onlu works if there are always a high rate of survivors and even if that was not an issue consider 4 billion years and a generous generation life of one year (natural selection cycle), 4 billion isn't a whole lot even for small features when you don't have an enormous population and birth rate. Let's say there were 100000 humans at some point and only a 1000 fatal features (being generous) it's not just the replacement rate of defective humans that needs to exceed the elimination rate, a certain percent of replacements must be free of all fatal defects and survive. Also, consider how there should be many failed species that attempted to evolve into a human like species or a primate. You can't always luck out, at some point the entire branch has to fail, requiring subsequent attmepts meanwhile the fatal conditions that required the evolution will not go away.

KingMob 1248 days ago

> It onlu works if there are always a high rate of survivors

There doesn't have to be a high rate of survival if the reproductive rate compensates for losses.

E.g., if 80% of wild rabbits are eaten, but the remaining 20% can give birth to 5 bunnies per parent per lifetime, the population will be stable.

I have no idea where you're getting your beliefs, but most of it is wrong in both the math and biology.

melagonster 1248 days ago

it is natural selection. this is most famous mechanism of evolution.

ramblerman 1248 days ago

It's interesting that the requirements for a text model are so much greater than for images.

Stable diffusion can run on a home pc, while it seems you need a super computer for GPT3. I'm not sure that would have been my intuition.

sadpasture 1247 days ago

I think it has to do with text being much more precise. Your stably diffused cartoon avatar having 6 finger is not nearly as noticeable as a language model's chat mispelling every second word. So you need less resources to get to a human acceptable result

andbberger 1247 days ago

no, diffusion models are just more efficient

mike_hearn 1248 days ago

Don't forget training costs, labor costs used for RLHF and (most likely) the money required for such large volumes of training data.

hansvm 1247 days ago

Doesn't ChatGPT fine-tune one of the smaller GPT-3s, not the 175B parameter model?

sdrg822 1248 days ago

For things like BERT where you just want to extract an embedding, the naive way you reach full utilization at inference time is that you :

- run tokenization of inputs on CPU

- sort inputs by length

- batch inputs of similar length and apply padding to make of uniform length

- pass the batches through so a single model can process many inputs in parallel.

For GPT-style decoder models however, this becomes much more challenging because inference requires a forward pass for every token generated. (Stopping criteria also may differ but that’s another tangent).

Every generated token performs attention on every previous token, both the context (or “prompt”) and the previously generated tokens (important for self consistency). this is a quadratic operation in the vanilla case.

Model sizes are large , often spanning multiple machines, and the information for later layers depends on previous ones, meaning inference has to be pipelined.

The naive approach would be to have a single transaction processed exclusively by a single instance of the model. this is expensive! even if each model can be crammed into a single A100 , if you want to run something like Codex or ChatGPT for millions of users with low latency inference, you’d have to have thousands of GPUs preloaded with models, and each transaction would take a highly variable amount of time.

If a model spans multiple machines, you’d achieve a max of 1/n% utilization because each shard has to remain loaded while the others process, and then if you want to do pipeline parallelism like in pipe dream, you’d have to deal with attention caches since you don’t want to have to recompute every previous state each time

JoeyBananas 1248 days ago

> Does that estimate include amortization of upfront development costs?

The answer is almost certainly "no." A service like Chat GPT is expensive because it requires heavy-duty GPU computations.

Jensson 1248 days ago

The gpu required to run it (A100) is said to cost about $150k. If each query is said to cost about 3 cents, then that means the card could execute the model about 5 million times before it makes profit. Maybe a bit more if we include the electricity bill, and even more if Microsoft charges extra for the service since they want to make profit.

I don't think these numbers sounds very out of line. It would be easier to understand the feasibility of this if we knew how fast those cards could execute the model. If it takes a second to run it then a few cents seems about right, if it takes a few milliseconds then it is a lot less than a few cents unless Microsoft charges huge premium for the servers.

trillic 1248 days ago

an 80gb A100 is not $150k, more like $10-15k.

machinekob 1248 days ago

But model need about 350gb so I'm not sure one A100 with 80gb will be enough?

toomuchtodo 1248 days ago

https://twitter.com/tomgoldsteincs/status/160019698195510069...

https://threadreaderapp.com/thread/1600196981955100694.html

jhoelzel 1244 days ago

Because its a language model and really does not query information but assumes relationships with them. Meaning that "the words" have to be encoded and brought into relation by text.

Now there are different ways to achieve this, but in essence because it has to know everything all at once plus instructions on how to handle that.

You can actually ask it to explain to you how you could create a natural language processing algorithm yourself and it will even give you a starter framework in the language of your choice. But a fair warning, for me it was like a 6 hour deep rabbit hole :D

sinenomine 1248 days ago

The model is large and every instance likely (not sure about the absolute degree they optimized the model) requires several GPUs (or high-grade accelerators) to run at a moderate speed.

Read the papers.

ilaksh 1248 days ago

Apparently each query requires hundreds of GBs of GPU RAM on several expensive accelerator cards.

Is the H100 deployed at Azure? I wonder how much more efficient that would be over A100s.

jejeyyy77 1248 days ago

Which seems insane considering stable diffusion can run on a M1 MacBook.

ilaksh 1247 days ago

Sure but they are totally different algorithms doing different things.

jejeyyy77 1236 days ago

We’ll yeah

lee101 1248 days ago

Basically gpu/compute costs being so expensive. Probably just the chat cost itself. also a whole boat load of Development costs will eventually be passed on to consumers, for a cheaper alternative try https://text-generator.io It also analyses images which OpenAI doesn't do

scarface74 1248 days ago

My question is how long will it be before the average high end computer can run it? How long before your average smart phone?

Memory shipped with computers have been stagnate for a decade

faebi 1248 days ago

Maybe that will be the next use case to make larger amounts of memory mainstream. At the same time, somehow Tesla still manages to cram more and more neural nets into that small memory. So it could also be that many neural nets are just not really efficient yet.

est31 1248 days ago

People are already trying to put cutting edge models onto consumer hardware: https://news.ycombinator.com/item?id=32678664

We live in a really exciting age :). Local AI models will also finally give Microsoft reasons again to require hardware for coming Windows versions. Now they have to require obscure security chips and stuff but in the future they might have some local cortana thingy or something that requires a certain amount of computational power.

thealch3m1st 1248 days ago

Could models like chatGPT run on hardware like the Tesla Dojo ? If so maybe Elon should donate some...

sidibe 1248 days ago

Does dojo even exist? He kept talking about how it was almost ready a couple years ago, no word since which is strange from such a braggart

cypress66 1248 days ago

They unveiled it on AI day 2021, talked more about it on AI day 2022, and in theory should start operating Q1 2023.

thealch3m1st 1248 days ago

It would be a good idea, no ?

DoesntMatter22 1248 days ago

I think it's actually quite cheap for what it is

MuffinFlavored 1248 days ago

what useful purpose have you found for ChatGPT given the “it can return inaccurate results posed as accurate” problem?

DoesntMatter22 1246 days ago

It does coding things extremely well. Are there some errors here or there at times? Yes but in general it does it excellently. I think this is a good example of not letting the perfect be the enemy of the good.

It will write 200 lines of code for me which would maybe take me a few hours. I have to spend 15 minutes cleaning it up, but still it saved me 80% of the time. It's a massive win.

Also great for writing articles, or emails. I write what I want to say into ChatGPT and tell it to state rewrite it to be pleasant and less harsh and it does a great job of that.

elbear 1248 days ago

Have it do stuff you know how to do, just a lot faster. Or, even if you don't know exactly how to do it, check what it gave you to see if it produces expected results.

For example, it gives you code. You run that code to see if the outputs are as expected.

thunderrabbit 1248 days ago

Yes it works well for that! I used ChatGPT recently to write a quick code snippet that turned out better than what I found on SO or could have written myself 50X slower.

https://www.robnugen.com/journal/2023/01/14/chatgpt-helped-m...

z3r0k00l 1248 days ago

python

sinenomine 1248 days ago

If you really measure what is being run, it is more likely well-optimized CUDA GPU assembly kernels - or, at this point - might be already some exotic TPU-like accelerator assembly.

This hubris over the top-level language in the system is so passe, so 2000s.

mulligan 1248 days ago

at this scale, the ml models are usually compiled into a format that runs independent of python. so the answer isn't "python"