I'm confused. The article from September 1 linked to here is strangely future-tense ("But the firm’s latest investment in 10,000 of the company’s H100 GPUs dwarfs the power of this supercomputer....This AI cluster, worth more than $300 million, will offer a peak performance...").
It links to a Tom's Hardware article (https://www.tomshardware.com/news/teslas-dollar300-million-a...) from August 28 that says "Tesla is about to flip the switch on its new AI cluster, featuring 10,000 Nvidia H100 compute GPUs") and says "Tesla is set to launch its highly-anticipated supercomputer on Monday..." (presumably the September 1 event).
So, like, does Tesla actually have 10k H100s? Or do they have an order for 10k H100s? Or an intention to buy 10k H100s?
I don't mean to snipe, but this article doesn't seem to rise to the extremely high editorial standards of such tech-press luminaries as "TechRadar" and "Hacker News".
So another guy who claims to be a Tesla employee says (again, strangely future tense) that this is true? I mean, I am willing to believe--'cause he paid $20 for a blue check--that he probably is a Tesla employee.
But the use of future tense is a bit weird, right? And the lack of any followup?
> A “random Twitter post by some guy who runs an online clothing company” is definitely a wrong assumption.
I guess I'm old. Back in my day, "evidence" wasn't some random dude's online posts. But I know things have changed. ;)
So, yes, I can sorta imagine Tesla pre-ordered 2% of global supply of H100s early in 2023 and was bragging about it at the end of August just 'cause.
But I can also imagine this is smoke and mirrors, and they have, like, a handful with the rest on backorder, and we haven't heard more about it 'cause Tesla doesn't have marketing people, it just has wahoos who post things on Twitter.
> So another guy who claims to be a Tesla employee says (again, strangely future tense) that this is true? I mean, I am willing to believe--'cause he paid $20 for a blue check--that he probably is a Tesla employee.
Another case of misuse? Here’s a tip for you. When you see a company logo/icon on someone's Twitter/X profile. That means they are verified to be affiliated with that org.
“Accounts affiliated with the organization will receive an affiliate badge on their profile with the organization’s logo, and will be featured on the organization’s Twitter profile, indicating their affiliation. “
> I guess I'm old. Back in my day, "evidence" wasn't some random dude's online posts. But I know things have changed. ;)
I linked a video where CNBC was interviewing Sawyer but it seems that you didn’t even bother to check it.
This seems to be the problem today. People refuse to do the bare minimum (which is not even much) required for critical thinking. Instead of verifying information, people tend to uncritically repeat inaccurate assumptions, even when provided with additional information in good faith.
Sure. I’m being a bit snarky. But I think the point stands that a single tweet from an employee saying “we’re about to do $thing” doesn’t exactly mean that, two months later, we should be reading a story whose sole origin is that tweet as evidence that $thing actually ended up happening.
Like, whats the actual news story here?
Totally agree there’s a lack of critical thinking at play.
Also, I think the X.com links only work if you have a login or something, fyi.
I understand that the H100 is NVidia's leading edge chip, but can someone let me know if 10K is considered to be a big cluster?
I've never worked inside one of the leading edge AI companies like OpenAI, Google, Microsoft or Meta.
Is this comparable to what they would work with?
My first guess is that it seems much smaller. And if you are running many parallel training jobs then you are getting about 1,000 chips at most to work with.
Or is this about what the leading competitors are working with?
Azure, for one, seems to have orders of magnitude more chips at their disposal.
10k H100 chips is considered a very large cluster. The third fastest supercomputer in the world is Microsoft’s eagle with 14k H100s https://www.top500.org/lists/top500/2023/11/
Ah, gotcha, so the fact that its 10,000 chips for one dedicated cluster that makes it large, as opposed to Azure which has an order of magnitude more GPUS but rents many of those out.
High performance on a single task requires simultaneous computation and communication between nodes. If there's high latency between nodes, such as between nodes in different data centers, the communication costs can't be masked by computation.
I previously ran 150,000 AMD GPUs. 10k doesn't seem that large. =)
That said, these GPUs aren't just the GPUs. They are whole chassis. They are huge onboard storage arrays, TB's of RAM, 800G networking (and associated cables), racks, cooling, power distribution, backup power, etc...
ETH PoW. When ETH switched to PoS, we shut it all down. It sure was fun while it lasted, not many people on the planet have run that much compute.
I did a lot of unique optimizations to autotune each individual GPU for performance by tweaking the software knobs on them. They are all snowflakes. Same model, different batches (heck, even same batch!), can produce wildly different performance results.
Over the years, I did try to find some alternative workloads for it, but nothing could even pay for the power costs. The GPUs were very old models (rx470-rx580) and the rest of the hardware wasn't that advanced, like it is in AI, so none of it was transferred.
I'm in the process of building my own AI supercomputer now. Really looking forward to seeing how it turns out.
Thanks, but not my style, sorry! I've been doing PoW mining since 2014 and have so many stories, I've forgotten half of them. I wouldn't even know where to start on trying to document any of it.
Of course I can't say anything about that other than I did the job I was hired to do, and I performed far above anyone's wildest expectations.
Nobody else on the planet was able to automate the tuning like I did, which had a direct influence on ROI. I know this because it required a very specific change to the AMD drivers to enable that functionality to happen.
This is a big cluster, definitely large enough to pretrain 100B+ parameter LLMs in months. Source - I work at Databricks in the ML platform.
I don’t know much about AV processing, that’s highly customized to only a few customers but I’d expect it to also have very large computational requirements to do video processing and reinforcement learning.
Unfortunately no, but there are almost certainly clusters in the hands of private companies and government organizations that would prefer not to advertise their capabilities.
Last I heard, the estimate was that NVIDIA would build 550k units in 2023, so 2% of all production — especially as at least six others (your four plus Apple and at least one intelligence agency) will be of similar size by themselves — is certainly non-negligible.
Other than the ~12% I just estimated, lots of large-but-not-famous places will be buying ~1k, and small places will be buying tens to hundreds, and quite a lot of AI bubble money will be invested in startups that claim they only need one.
Probably some scientific modelling that can be done on these, so I bet some universities and private labs will be buying them. NASA, SpaceX, RocketLab, Helion, etc.
There's also probably a lot of AAA game studios and art studios for movies etc. who are each buying dozens of these graphics processing units for… graphics :P
What happened to their custom hardware training stack Dojo? They had some interesting ideas there. The last I heard, they had one of those tiles "working" in the lab. Pretty far from a production setup.
I can imagine they either underestimated the software effort needed to squeeze as much performance as possible out of those things, or they underestimated the pace at which Nvidia scales FLOPS/$, or both.
When we've dealt with the oil companies, the chemical manufacturers dumping PFAS into our kids, and the industrial war machine, maybe then we can start complaining about the guy biting off more than he can chew trying to be constructive.
Until then, all of you sound vicious, bitter, and hypocritical.
We are perfectly capable of having issues with all of those as well. I can still ask for a speedbump on my street while also voicing concern with military postering… Crazy I know! Musk passed ‘guy biting off more than he chew’ when he started accusing heroes of being pedophiles.
John Carmack on a shoestring budget nearly got this working at Armadillo. If he had more money and time he would've had it working a half decade before SpaceX.
Most of those things have happened actually, but the website makes it seems that they didn't. It just lists everything Elon has said, but doesn't track if they happened or not. This is a completely pointless website.
That Telsa owners can use their cars to make money while they are working as robo taxis -let's just say he vastly underestimates effort it takes to make progress - FSD is not there yet.
They will get there at some acceptable point but not with the tech in current Tesla's - the current compute module will need to be replaced - think they showed off HW 4 in lieu of HW 3.
Sucks to be if you paid the early bird fee for it.
Hmm that website could be really interesting if it clearly didn’t try being misleading, e.g. a statement about something being in development but not yet out isn’t a promise that it will be out now. Really silly.
Also if it didn’t exclude everything that has been delivered. That long list would be interesting to see as well.
> This AI cluster, worth more than $300 million, will offer a peak performance of 340 FP64 PFLOPS for technical computing and 39.58 INT8 ExaFLOPS for AI applications, according to Tom’s Hardware.
I was curious why this statement lead with fp64 flops (instead of fp32, perhaps), but I looked up the H100 specs, and NV’s marketing page does the same thing. They’re obviously talking about the H100 SXM here, which has the same peak theoretical fp64 throughput as fp32. The cluster perf is estimated by multiplying the GPU perf by 10k.
Also, obviously, int8 tensor ops aren’t ‘FLOPS’. I think Nvidia calls them “TOPS” (tensor ops). There is a separate metric for ‘tensor flops’ or TF32.
In the old days, depending on architecture, fp64 performance could be atrocious even when fp32 was decent, so bragging about fp64 performance has an authenticity to it. Not all scientific computing requires 64 bits, but knowing that you can drop to high precision when necessary without penalty is nice.
Also, back in the day, integer ops were just called 'ops', grumble grumble. But yeah FLOPS specifically refers to floating point. Calling them TOPS doesn't make sense to me, since tensor cores were meant for matrix operation speedup, and these matrices are rarely integer.
Still true that fp64 throughput is lower for consumer GPUs - both NV and AMD. That’s kinda why I was curious about leading with that metric - outside of HPC and scientific applications, a lot of people don’t really need fp64, and the machine might normally have a much higher fp32 throughput.
> knowing you can drop to high precision when necessary without penalty is nice.
I guess I maybe don’t know why you’d ever have 1:1 fp32 and fp64 perf. Aren’t the fp64 multipliers (for example) basically 4x fp32 multipliers? I am under the possibly naive impression that if you have all the transistors for 1 fp64 core, that you’d end up with all the transistors you need for 2 or 4 fp32 cores. Maybe that’s not true today, but there does have to be at least 2x the transistors overall for 64-bit vs 32-bit, and lots of those should be shared or reusable, no? It doesn’t seem quite right to frame naturally higher 32-bit op throughput as a “penalty” on 64-bit ops. You’re asking the hardware to do more with 64, and it makes complete sense that given the exact same budget for bandwidth, energy, memory, compute, etc. that 32-bit ops would go faster, no? If the op throughput of fp64 and fp32 is the same, doesn’t that possibly imply that the fp32 ops are potentially being wasted / penalized, just for the sake of having matching numbers?
This is also related to "fast" versions of all some operations. You might want the full 32 bit float but you dont want or need to do full precision division or sqrt operations. This is common in games/graphics and probably machine learning.
You're right -- I have no idea why fp64 wouldn't be half the speed of fp32, and traditionally it is. I was simply taking them at their word. Maybe they're exaggerating or maybe they did what you suggest and hamstrung fp32.
n00b questions from someone just beginning to get interested in HPC
I see mention of using this supercomputer for training models. Is that the only purpose? What other types of things do orgs usually do with these supercomputers?
Are there any good boots-on-the-ground technical blogs that provide interesting detail on day-to-day experiences with these things?
As opposed to keeping all of your servers independent of each other, super computers are used any time you want to pretend the entire computer is one computer.
In other words, they're used when you want to share some kind of state across all of the computers, without the potential overhead of communicating to some other system like a database.
Physics simulations and like, molecular modeling come to mind as common examples.
In the case of ML training, model parameters and broadcasting the deltas that get calculated during training are that shared state.
Newbie question, could this cluster easily calculate the largest prime number? I've found that the largest known prime number was found back in 2018, which is a while back considering how compute has evolved.
Finding the largest prime is more a contest of who's willing to commit the most ridiculous amount of compute to the goal than it is a mathematical obstacle.
The cost of finding the next prime is likely into the millions now.
It's funny - I'm listening to "The Founders" audiobook and right now they're telling the story of Elon Musk at PayPal wanting to rewrite for Windows server because Linux was too hard.
Weird to think that his next company's compute platform is this.
Oh that simple, huh? Too bad you weren’t around in the late 90s to explain this to him, and to help him find the extremely rare group of folks familiar with Linux…
“Wanting to rewrite for windows” is what you said; was that not accurate?
The key thing here however is that Elon didn’t want whatever he asked for in a vacuum, despite what the book says. Surely this was his engineer’s preference.
It’s bottleneck on Nvidia side. They are producing less than Tesla consume. Tesla compute power will outclass many cloud provider combined in just three or four years with their own custom chip.
> It’s bottleneck on Nvidia side. They are producing less than Tesla consume. Tesla compute power will outclass many cloud provider combined in just three or four years with their own custom chip.
That seems like a bold claim. Google, Microsoft and Meta make so much more money than Telsa that if making AI chips was so easy, then they could clearly out design and build Tesla without thinking too hard about it.
What makes you think that Telsa, a company with far less AI workers and knowledge, and far less money than the above companies can out design and out build them?
> What makes you think that Telsa, a company with far less AI workers and knowledge, an far less money than the above companies can out design and out build them?
Presumably because Elon himself will be involved in the design, and Elon, as we all know, is one of the world's great thinkers. ;)
Do you mind explaining what makes him good at it? Pay? Atmosphere? Management style?
From what I heard about SpaceX it seems to be a place grads go to burn out while being paid below market rate simply because they're excited about the idea. Maybe that impression is wrong, so I'd like to hear other perspectives.
The dirty little open secret with a lot of these platforms is the contract sizes, hardware costs, etc are so massive they come with multiple teams of dedicated engineers and internal expertise to get your application(s) up and running on them. Obviously these things are never quite "pull a docker container and run" and no one dropping eight-nine figures on these installs is going to do it without serious vendor backing and support.
It's part of the reason why AMD has had quite a bit of success here but is in single digit market share for "AI" otherwise.
Most people - even large orgs with thousands of GPUs - are so trapped in CUDA the theoretical on paper performance and cost benefits evaporate immediately when you spend all of your time trying to port everything over to the point you get equivalent performance and functionality.
You’re being pedantic (rightfully so) and I’m being loose with words. While Dojo is the supercomputer Tesla built for vision training, I lumped anything contributing to their machine vision model training as Dojo. It’s called Dojo because that’s where the training takes place.
From the History section (although Technical Architecure is also worthy of consuming in its entirety):
> In August 2023, Tesla powered on Dojo for production use as well as a new training cluster configured with 10,000 Nvidia H100 GPUs.
I’ll take the L wrt being flippant if we’re using words very specifically in this context, that’s fair. It’s great to see Tesla expand its training resources is my sentiment, regardless of how their aggregate ML compute is segmented.
Can you imagine how much power 10,000 H100s actually produces in production? I bet you'd be able to run modern games on a cluster that large at a full 60 FPS.
It links to a Tom's Hardware article (https://www.tomshardware.com/news/teslas-dollar300-million-a...) from August 28 that says "Tesla is about to flip the switch on its new AI cluster, featuring 10,000 Nvidia H100 compute GPUs") and says "Tesla is set to launch its highly-anticipated supercomputer on Monday..." (presumably the September 1 event).
So, like, does Tesla actually have 10k H100s? Or do they have an order for 10k H100s? Or an intention to buy 10k H100s?
Is the sole source for these articles this (https://twitter.com/SawyerMerritt/status/1696011140508045660) random Twitter post by some guy who runs an online clothing company?
I don't mean to snipe, but this article doesn't seem to rise to the extremely high editorial standards of such tech-press luminaries as "TechRadar" and "Hacker News".