Interesting to think about where the cost will go in a few years.
I remember in college intro to CS class back in 1998, where I heard the story of building the first computer that could perform at 1 TFLOPS[1]. It cost $46 million and took up 1600 square feet. Now a $600 Mac Mini will do double that.
It is not going to go down much anymore, because the end of Moore's law has been reached as physical limitations become a factor. You cannot scale chips close to 1 atom wide transistors.
Moore's law isn't dead. Only Dennard's law. See slide 13 here[0]. Moore's law stated that the number of transistors per area will double every n months. That's still happening. Besides, neither Moore's law nor Dennard scaling are even the most critical scaling law to be concerned about...
...that's probably Koomey's law[1], which looks well on track to hold for the rest of our careers. But eventually as computing approaches the Landauer limit[2] it must asymptotically level off as well. Probably starting around year 2050. Then we'll need to actually start "doing more with less" and minimizing the number of computations done for specific tasks. That will begin a very very productive time for custom silicon that is very task-specialized and low-level algorithmic optimization.
[0] Shows that Moore's law (green line) is expected to start leveling off soon, but it has not yet slowed down. It also shows Koomey's law (orange line) holding indefinitely. Fun fact, if Koomey's law holds, we'll have exaflop power in <20W in about 20 years. That's equivalent to a whole OpenAI/DeepMind-worth of power in every smartphone.
Also even MHz increases have had a bit of a comeback lately, with the fastest mid-2000's Pentium 4's reaching 3.8-4.2GHz and the latest Ryzen 7000's reaching 6GHz.
I’ll take this 10 year bet. You really think nvidia is just gonna stop releasing new revisions? “Moores law is dead” is way over-memed, it’s more of an axiom about how computers continually improve than really being about transistor count at this point.
Moore's law and more importantly dennard scaling both died in the mid 2000s. Nvidia is in fact successful because of the end of dennard scaling and the shudts do more mission specialized silicon like TPUs, and codec accelerators, inference engines are also a consequence of that.
Nvidia's performance gains in recent years has been about scaling chip size and making more efficient use of each transistor both in terms of power and count than anything else. A large part of that is minimizing how far data physically moves for any given workloads via stuff like HBM, memory compression, and smarter/larger caches.
In fact, Nvidia doesn't even really try to be on the bleeding edge nodes anymore because per transistor costs has been trending up or level on bleeding edge nodes for at least 5 years now.
Hi, one of the authors of the post, we will update the post with numbers from 256 GPU run within the next few days. We estimated the 256 run to be the fastest (13 days), but also the most expensive at $160k. The measured 128 GPU run would take 21 days but for $125k if you are interested in lower costs.
That is not necessarily a saving. If you have a let's say a team of five people each costing $1000 a day, those idle 6 days (not counting the weekend) would add up to $30k of wasted money. Then if you are working on something the competition is also working on, these lost days would add up and potentially cost losing the edge - could be quite expensive or even cost the business.
Still pretty pricey for average person, but these will trend cheaper and why I think it's futile to "regulate" AI. Someone somewhere will train models on anything visible to public, licensed or not. Feels like Pandora's box has been opened and we need to deal with it.
The companies that wait for a 100% "clean" model are going to get left behind. e.g. ChatGPT launching despite Google, Meta and others already having very similar technology internally
There is already a class action lawsuit. The companies that move forward with "dirty" models can be wiped out by legal fees before they got off the ground.
This will likely just entrench the companies with pockets deep enough to satisfy the lawyers in the class action suits. It might burn down a startup. On the other hand, if Microsoft thinks it's a potential $500 billion business, a $1 billion settlement is just table stakes.
That likely wouldn't be on the table in a settlement offer, and it might be a tough sell for the plaintiff class to get nothing or a protracted legal battle instead of an easy and significant payout. Anything is possible, but I don't think a lawsuit's likely outcome in this situation would be a scorched earth fight.
Has this ever happen in practice for well funded company outside Napster case? I am skeptical that training on publicly accessible data can be ruled illegal. Too many side effects including making Google Search problematic.
As usual, AI has no agency. I believe we should view AI as simply an extension of our own agency. Thus, if you prompt an AI to generate copyrighted work, that's fine. Viewing it yourself is like imagination. However, just as you can draw mickey mouse for your own fun all you want, you cannot sell such images.
That’s kindof already the case. Ignoring the legality of GitHub Copilot itself, if it suggests code that violates copyright and you use it, you are (probably most likely[a]) still infringing. You can’t hide behind “the AI did it.”
5 bucks says within a year there’ll be some innovation that shrinks this by 2 orders of magnitude. Either from much cheaper compute cost (eg OPUs) or much more efficient training. Hell, there ought to be some way to leapfrog these innovations in such a way that the huge model of yesteryear becomes a more powerful optimizer/loss function itself. That’d just about solve the “hands off my unique shapes!” problem of acceptable training data trawling too :)
Note that this doesn't take into account the numerous iterations required to dial in the correct hyperparameters and model architecture, which could easily increase cost 5-10x.
> 256 A100 throughput was extrapolated using the other throughput measurements
Is it an indictment of their service that they couldn't afford 256 GPUs on their own cloud?
It's an indictment of the A100 node that died on us yesterday, leaving us with 248 GPUs in the particular cluster where we were running the experiments :(
It turns out that, in these kinds of large-scale experiments, hardware failures are a constant fact of life, and we have tools to manage these hardware failures and allow runs to continue anyway.
Unfortunately, it would mess up our throughput calculations for getting clean baselines here, so we're waiting for our cloud provider to kindly replace the bad A100. Expect those numbers in the next day or so.
Getting reliable GPUs is a difficult problem, I empathize. I've spent a decent amount of time and money because there was one failing GPU on an AWS cluster.
We've come to accept that it's an impossible problem at this point. Instead, we're getting good at automatically detecting hardware failures and rapidly restarting runs on fewer nodes. We're also exploring batch sizes that are (where possible) divisible by N nodes and N-1 nodes. Fault tolerant system design is unfortunately an evergreen topic in CS.
Data truly is the new oil. When it’s all done the compute costs and code will be cheap or free. There’s a lot hinging on how we interpret copyright laws or what kind of data rights laws we enact.
This task requires a bit more work than I'd want, but I'd also point out $100k can buy ~9 A100's which are good for ~7k hours of work a month (through not entirely reputable channels, so there's a chance some might die earlier or might have to be returned). That might not train Stable Diffusion in a fast enough time for you (~50k hours estimated training time), but it's still damned impressive. And you can keep the hardware.
I wonder if AMD is as over-the-top brutal with legal control over where their GPUs can be used as Nvidia is. Maybe with energy cost you might possibly still want to stick with the A100's anyways, but you can afford quite a lot of RX 7900's with $100k (if you can find em).
It's interesting to compare the cost for cloud GPU's vs. buying the hardware outright. At ~$10,000 per Nvidia A100 GPU, it seems like this cloud provider would break even on the hardware after about 5 months at these rates. There are certainly other costs involved (racking, power, etc.), but that's not too bad. I'm almost surprised Nvidia doesn't cannibalize it's hardware sales by running its own cloud.
Very very rough estimate, using inference benchmarks, which can't necessarily be extrapolated to training, but if a A100 takes 6.49 seconds to generate an image, and a EPYC 7352 24-core cpu takes 223.19 seconds[0], that's 34 times slower.
So you would need at least 2,716,796 hours to train on CPU.
A m6a.12xlarge is roughly equivalent to a EPYC 7352 24-core[1], it currently costs
$0.5028 an hour on spot.
I remember in college intro to CS class back in 1998, where I heard the story of building the first computer that could perform at 1 TFLOPS[1]. It cost $46 million and took up 1600 square feet. Now a $600 Mac Mini will do double that.
[1] https://en.wikipedia.org/wiki/ASCI_Red