Training Stable Diffusion from Scratch Costs <$160k

Y	Hacker News new \| ask \| show \| jobs

	Training Stable Diffusion from Scratch Costs <$160k (mosaicml.com)
	98 points by moinnadeem 1241 days ago

16 comments

mullingitover 1241 days ago

Interesting to think about where the cost will go in a few years.

I remember in college intro to CS class back in 1998, where I heard the story of building the first computer that could perform at 1 TFLOPS[1]. It cost $46 million and took up 1600 square feet. Now a $600 Mac Mini will do double that.

[1] https://en.wikipedia.org/wiki/ASCI_Red

link

DSingularity 1241 days ago

And on a much more programmable platform - which is just as important in terms of accessibility !

link

miohtama 1241 days ago

It is not going to go down much anymore, because the end of Moore's law has been reached as physical limitations become a factor. You cannot scale chips close to 1 atom wide transistors.

link

runnerup 1238 days ago

Moore's law isn't dead. Only Dennard's law. See slide 13 here[0]. Moore's law stated that the number of transistors per area will double every n months. That's still happening. Besides, neither Moore's law nor Dennard scaling are even the most critical scaling law to be concerned about...

...that's probably Koomey's law[1], which looks well on track to hold for the rest of our careers. But eventually as computing approaches the Landauer limit[2] it must asymptotically level off as well. Probably starting around year 2050. Then we'll need to actually start "doing more with less" and minimizing the number of computations done for specific tasks. That will begin a very very productive time for custom silicon that is very task-specialized and low-level algorithmic optimization.

[0] Shows that Moore's law (green line) is expected to start leveling off soon, but it has not yet slowed down. It also shows Koomey's law (orange line) holding indefinitely. Fun fact, if Koomey's law holds, we'll have exaflop power in <20W in about 20 years. That's equivalent to a whole OpenAI/DeepMind-worth of power in every smartphone.

0: (Slide 13) https://www.sec.gov/Archives/edgar/data/937966/0001193125212...

1: "The constant rate of doubling of the number of computations per joule of energy dissipated" https://en.wikipedia.org/wiki/Koomey%27s_law

2: "The thermodynamic limit for the minimum amount of energy theoretically necessary to perform an irreversible single-bit operation." https://en.wikipedia.org/wiki/Landauer%27s_principle

link

runnerup 1238 days ago

Also even MHz increases have had a bit of a comeback lately, with the fastest mid-2000's Pentium 4's reaching 3.8-4.2GHz and the latest Ryzen 7000's reaching 6GHz.

link

Our_Benefactors 1240 days ago

I’ll take this 10 year bet. You really think nvidia is just gonna stop releasing new revisions? “Moores law is dead” is way over-memed, it’s more of an axiom about how computers continually improve than really being about transistor count at this point.

link

andromeduck 1240 days ago

Moore's law and more importantly dennard scaling both died in the mid 2000s. Nvidia is in fact successful because of the end of dennard scaling and the shudts do more mission specialized silicon like TPUs, and codec accelerators, inference engines are also a consequence of that.

Nvidia's performance gains in recent years has been about scaling chip size and making more efficient use of each transistor both in terms of power and count than anything else. A large part of that is minimizing how far data physically moves for any given workloads via stuff like HBM, memory compression, and smarter/larger caches.

In fact, Nvidia doesn't even really try to be on the bleeding edge nodes anymore because per transistor costs has been trending up or level on bleeding edge nodes for at least 5 years now.

link

wokwokwok 1241 days ago

Is this just an ad for a service?

They didn’t make anything.

This is just speculative benchmarking.

I am deeply not interested in multiplying the numbers on your pricing sheet by the estimated numbers on the stable diffusion model card.

I have zero interest in your (certainly excellent) Proprietary Special Sauce (TM) that makes spending money on your service a good idea.

This just reads as spam that got past the spam filter.

Did you actually train a diffusion model?

Are you going to release the model file?

Where is the actual code someone could use to replicate your results?

Given the lack of example outputs, I guess not.

link

ml_hardware 1241 days ago

Did you actually read the blog? The very first sentence is:

> Try out our Stable Diffusion code here! > https://github.com/mosaicml/diffusion-benchmark

link

abeppu 1241 days ago

> *256 A100 throughput was extrapolated using the other throughput measurements.

It seems worth noting that the $160k scenario wasn't actually measured.

link

landanjs 1241 days ago

Hi, one of the authors of the post, we will update the post with numbers from 256 GPU run within the next few days. We estimated the 256 run to be the fastest (13 days), but also the most expensive at $160k. The measured 128 GPU run would take 21 days but for $125k if you are interested in lower costs.

link

varispeed 1241 days ago

That is not necessarily a saving. If you have a let's say a team of five people each costing $1000 a day, those idle 6 days (not counting the weekend) would add up to $30k of wasted money. Then if you are working on something the competition is also working on, these lost days would add up and potentially cost losing the edge - could be quite expensive or even cost the business.

link

landanjs 1241 days ago

Agreed. When I mentioned lower costs, this was exclusive to model training, but there are so many other factors that influence the cost to a business.

link

epicycles33 1241 days ago

Glad to see this - you can even get reasonable-ish results on lower res images with ~2 hours train time on a P100 GPU. See my try here: https://www.kaggle.com/code/apapiu/train-latent-diffusion-in...

link

gedy 1241 days ago

Still pretty pricey for average person, but these will trend cheaper and why I think it's futile to "regulate" AI. Someone somewhere will train models on anything visible to public, licensed or not. Feels like Pandora's box has been opened and we need to deal with it.

link

adam_arthur 1241 days ago

The companies that wait for a 100% "clean" model are going to get left behind. e.g. ChatGPT launching despite Google, Meta and others already having very similar technology internally

link

sp332 1241 days ago

There is already a class action lawsuit. The companies that move forward with "dirty" models can be wiped out by legal fees before they got off the ground.

link

mullingitover 1241 days ago

This will likely just entrench the companies with pockets deep enough to satisfy the lawyers in the class action suits. It might burn down a startup. On the other hand, if Microsoft thinks it's a potential $500 billion business, a $1 billion settlement is just table stakes.

link

Sebguer 1241 days ago

This ignores that a legal remedy can be "you can no longer offer this as a product".

link

mullingitover 1241 days ago

That likely wouldn't be on the table in a settlement offer, and it might be a tough sell for the plaintiff class to get nothing or a protracted legal battle instead of an easy and significant payout. Anything is possible, but I don't think a lawsuit's likely outcome in this situation would be a scorched earth fight.

link

dfadsadsf 1241 days ago

Has this ever happen in practice for well funded company outside Napster case? I am skeptical that training on publicly accessible data can be ruled illegal. Too many side effects including making Google Search problematic.

link

mensetmanusman 1241 days ago

That just means it will happen outside the us wherever laws aren’t enforced (China, Russia, etc.)

link

operatingthetan 1241 days ago

Doesn't that kind of action cause organizations to be less transparent about the sources of their models then?

link

anon291 1241 days ago

As usual, AI has no agency. I believe we should view AI as simply an extension of our own agency. Thus, if you prompt an AI to generate copyrighted work, that's fine. Viewing it yourself is like imagination. However, just as you can draw mickey mouse for your own fun all you want, you cannot sell such images.

link

colejohnson66 1241 days ago

That’s kindof already the case. Ignoring the legality of GitHub Copilot itself, if it suggests code that violates copyright and you use it, you are (probably most likely[a]) still infringing. You can’t hide behind “the AI did it.”

[a]: Of course, IANAL, and it’s up to the courts

link

pizza 1241 days ago

5 bucks says within a year there’ll be some innovation that shrinks this by 2 orders of magnitude. Either from much cheaper compute cost (eg OPUs) or much more efficient training. Hell, there ought to be some way to leapfrog these innovations in such a way that the huge model of yesteryear becomes a more powerful optimizer/loss function itself. That’d just about solve the “hands off my unique shapes!” problem of acceptable training data trawling too :)

link

odyssey7 1241 days ago

How many tries does it take for an expert to succeed at training a custom Stable Diffusion?

link

ipsum2 1241 days ago

Note that this doesn't take into account the numerous iterations required to dial in the correct hyperparameters and model architecture, which could easily increase cost 5-10x.

> 256 A100 throughput was extrapolated using the other throughput measurements

Is it an indictment of their service that they couldn't afford 256 GPUs on their own cloud?

link

jfrankle 1241 days ago

It's an indictment of the A100 node that died on us yesterday, leaving us with 248 GPUs in the particular cluster where we were running the experiments :(

It turns out that, in these kinds of large-scale experiments, hardware failures are a constant fact of life, and we have tools to manage these hardware failures and allow runs to continue anyway.

Unfortunately, it would mess up our throughput calculations for getting clean baselines here, so we're waiting for our cloud provider to kindly replace the bad A100. Expect those numbers in the next day or so.

link

ipsum2 1241 days ago

Getting reliable GPUs is a difficult problem, I empathize. I've spent a decent amount of time and money because there was one failing GPU on an AWS cluster.

link

jfrankle 1241 days ago

We've come to accept that it's an impossible problem at this point. Instead, we're getting good at automatically detecting hardware failures and rapidly restarting runs on fewer nodes. We're also exploring batch sizes that are (where possible) divisible by N nodes and N-1 nodes. Fault tolerant system design is unfortunately an evergreen topic in CS.

link

choxi 1241 days ago

Data truly is the new oil. When it’s all done the compute costs and code will be cheap or free. There’s a lot hinging on how we interpret copyright laws or what kind of data rights laws we enact.

link

ralph84 1241 days ago

The copyright on Mickey Mouse is due to expire next year, so there will definitely be some attempts at copyright "reform" this year.

link

rektide 1241 days ago

This task requires a bit more work than I'd want, but I'd also point out $100k can buy ~9 A100's which are good for ~7k hours of work a month (through not entirely reputable channels, so there's a chance some might die earlier or might have to be returned). That might not train Stable Diffusion in a fast enough time for you (~50k hours estimated training time), but it's still damned impressive. And you can keep the hardware.

I wonder if AMD is as over-the-top brutal with legal control over where their GPUs can be used as Nvidia is. Maybe with energy cost you might possibly still want to stick with the A100's anyways, but you can afford quite a lot of RX 7900's with $100k (if you can find em).

link

xnx 1240 days ago

It's interesting to compare the cost for cloud GPU's vs. buying the hardware outright. At ~$10,000 per Nvidia A100 GPU, it seems like this cloud provider would break even on the hardware after about 5 months at these rates. There are certainly other costs involved (racking, power, etc.), but that's not too bad. I'm almost surprised Nvidia doesn't cannibalize it's hardware sales by running its own cloud.

link

coding123 1241 days ago

There are some large AWS customers that probably burn that in idle time on a bunch of unused machines per week (probably day).

link

a9h74j 1241 days ago

Can the training be parallelized in a manner similar to SETI-at-home?

link

nodja 1241 days ago

Yes, hivemind trained a gpt 6B model like this.

General model training https://github.com/learning-at-home/hivemind

Stable diffusion specific https://github.com/chavinlo/distributed-diffusion

Inference only stable diffusion https://stablehorde.net/

link

mensetmanusman 1241 days ago

This ignores all the runtime costs for LLMs that aren’t operating effectively :)

link

marcooliv 1241 days ago

Is there any value of this cost that we can say "this is dangerous". For any reason?

link

capableweb 1241 days ago

You could probably take any technology and come up with some sort of reasoning about why it is dangerous.

link

xwdv 1241 days ago

We can do it for way less using spot instances on AWS, though it takes longer.

link

WatchDog 1241 days ago

AWS don't have any GPU spot instances right?

I highly doubt you could get anywhere near the same price.

link

WatchDog 1241 days ago

Very very rough estimate, using inference benchmarks, which can't necessarily be extrapolated to training, but if a A100 takes 6.49 seconds to generate an image, and a EPYC 7352 24-core cpu takes 223.19 seconds[0], that's 34 times slower.

So you would need at least 2,716,796 hours to train on CPU.

A m6a.12xlarge is roughly equivalent to a EPYC 7352 24-core[1], it currently costs $0.5028 an hour on spot.

So that works out to a cost of $1,366,005.

[0]: https://lambdalabs.com/blog/inference-benchmark-stable-diffu...

[1]: https://browser.geekbench.com/v5/cpu/compare/17529628?baseli...

link

graphe 1241 days ago

I've downloaded anime models for free. I'm sure they were <$160 without the k. https://github.com/Noah670/stablediffusionAnime

link

O__________O 1241 days ago

Those models are not from scratch.

link

manimino 1241 days ago

It is a fair point though - there's no utility in training an openly available model from scratch. Finetuning is far more practical.

link

O__________O 1241 days ago

Numerous reasons why someone might want to train model from scratch; for example, copyright province and licensing control.

link