Tesla turns on 10k-node Nvidia H100 Cluster | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Tesla turns on 10k-node Nvidia H100 Cluster (techradar.com)
	80 points by Kevcmk 981 days ago

17 comments

md_ 981 days ago

I'm confused. The article from September 1 linked to here is strangely future-tense ("But the firm’s latest investment in 10,000 of the company’s H100 GPUs dwarfs the power of this supercomputer....This AI cluster, worth more than $300 million, will offer a peak performance...").

It links to a Tom's Hardware article (https://www.tomshardware.com/news/teslas-dollar300-million-a...) from August 28 that says "Tesla is about to flip the switch on its new AI cluster, featuring 10,000 Nvidia H100 compute GPUs") and says "Tesla is set to launch its highly-anticipated supercomputer on Monday..." (presumably the September 1 event).

So, like, does Tesla actually have 10k H100s? Or do they have an order for 10k H100s? Or an intention to buy 10k H100s?

Is the sole source for these articles this (https://twitter.com/SawyerMerritt/status/1696011140508045660) random Twitter post by some guy who runs an online clothing company?

I don't mean to snipe, but this article doesn't seem to rise to the extremely high editorial standards of such tech-press luminaries as "TechRadar" and "Hacker News".

xedeon 981 days ago

> high editorial standards of such tech-press luminaries as "TechRadar" and "Hacker News".

If you would’ve just scrolled just a little bit on that Twitter post that you linked. You would’ve seen these:

https://x.com/sawyermerritt/status/1696012091964915744

https://x.com/tim_zaman/status/1695488119729238147

Also, just FYI. Sawyer posts most of the Tesla and SpaceX breaking news on Twitter before major outlets even write their articles.

For example, here’s one just 12mins ago as confirmed by Elon: https://x.com/sawyermerritt/status/1728092021628313777

A “random Twitter post by some guy who runs an online clothing company” is definitely a wrong assumption.

https://x.com/sawyermerritt/status/1709019899442479162

dpkirchner 981 days ago

I think you only see the additional tweets you're talking about if you're for whatever reason actually signed in to Twitter.

md_ 981 days ago

> If you would’ve just scrolled just a little bit on that Twitter post that you linked. You would’ve seen these:

I don't see those when I scroll. I see

"Buckle up everyone, the acceleration of progress is about to get nutty!"

and this is the end of the post?

Maybe I'm misusing this thing?

> https://x.com/tim_zaman/status/1695488119729238147

So another guy who claims to be a Tesla employee says (again, strangely future tense) that this is true? I mean, I am willing to believe--'cause he paid $20 for a blue check--that he probably is a Tesla employee.

But the use of future tense is a bit weird, right? And the lack of any followup?

> A “random Twitter post by some guy who runs an online clothing company” is definitely a wrong assumption.

I guess I'm old. Back in my day, "evidence" wasn't some random dude's online posts. But I know things have changed. ;)

==

More seriously:

https://www.hpcwire.com/2023/08/17/nvidia-h100-are-550000-gp... says Nvidia is producing 550k H100s in 2023. And there's obviously a significant lead-time requirement.

So, yes, I can sorta imagine Tesla pre-ordered 2% of global supply of H100s early in 2023 and was bragging about it at the end of August just 'cause.

But I can also imagine this is smoke and mirrors, and they have, like, a handful with the rest on backorder, and we haven't heard more about it 'cause Tesla doesn't have marketing people, it just has wahoos who post things on Twitter.

Either way, I guess?

xedeon 981 days ago

> Maybe I'm misusing this thing?

That seems to be the case here. ;)

> So another guy who claims to be a Tesla employee says (again, strangely future tense) that this is true? I mean, I am willing to believe--'cause he paid $20 for a blue check--that he probably is a Tesla employee.

Another case of misuse? Here’s a tip for you. When you see a company logo/icon on someone's Twitter/X profile. That means they are verified to be affiliated with that org.

“Accounts affiliated with the organization will receive an affiliate badge on their profile with the organization’s logo, and will be featured on the organization’s Twitter profile, indicating their affiliation. “

https://twitter.com/verified/status/1641596848921276417

Instead of inferring that Tim Zaman is a random Twitter user who paid $20 for a blue check. Why not just Google his name? ;)

https://letmegooglethat.com/?q=Tim+Zaman

> I guess I'm old. Back in my day, "evidence" wasn't some random dude's online posts. But I know things have changed. ;)

I linked a video where CNBC was interviewing Sawyer but it seems that you didn’t even bother to check it.

This seems to be the problem today. People refuse to do the bare minimum (which is not even much) required for critical thinking. Instead of verifying information, people tend to uncritically repeat inaccurate assumptions, even when provided with additional information in good faith.

md_ 981 days ago

Sure. I’m being a bit snarky. But I think the point stands that a single tweet from an employee saying “we’re about to do $thing” doesn’t exactly mean that, two months later, we should be reading a story whose sole origin is that tweet as evidence that $thing actually ended up happening.

Like, whats the actual news story here?

Totally agree there’s a lack of critical thinking at play.

Also, I think the X.com links only work if you have a login or something, fyi.

chollida1 981 days ago

I understand that the H100 is NVidia's leading edge chip, but can someone let me know if 10K is considered to be a big cluster?

I've never worked inside one of the leading edge AI companies like OpenAI, Google, Microsoft or Meta.

Is this comparable to what they would work with?

My first guess is that it seems much smaller. And if you are running many parallel training jobs then you are getting about 1,000 chips at most to work with.

Or is this about what the leading competitors are working with?

Azure, for one, seems to have orders of magnitude more chips at their disposal.

jeffreyames 981 days ago

10k H100 chips is considered a very large cluster. The third fastest supercomputer in the world is Microsoft’s eagle with 14k H100s https://www.top500.org/lists/top500/2023/11/

chollida1 981 days ago

Ah, gotcha, so the fact that its 10,000 chips for one dedicated cluster that makes it large, as opposed to Azure which has an order of magnitude more GPUS but rents many of those out.

jeffreyames 981 days ago

High performance on a single task requires simultaneous computation and communication between nodes. If there's high latency between nodes, such as between nodes in different data centers, the communication costs can't be masked by computation.

rightbyte 981 days ago

I guess Azure's are spread out too. Latency higher to world wide datacentres.

latchkey 981 days ago

I previously ran 150,000 AMD GPUs. 10k doesn't seem that large. =)

That said, these GPUs aren't just the GPUs. They are whole chassis. They are huge onboard storage arrays, TB's of RAM, 800G networking (and associated cables), racks, cooling, power distribution, backup power, etc...

None of it is easy.

LysPJ 981 days ago

Out of interest, what did you use all that compute for?

latchkey 981 days ago

ETH PoW. When ETH switched to PoS, we shut it all down. It sure was fun while it lasted, not many people on the planet have run that much compute.

I did a lot of unique optimizations to autotune each individual GPU for performance by tweaking the software knobs on them. They are all snowflakes. Same model, different batches (heck, even same batch!), can produce wildly different performance results.

Over the years, I did try to find some alternative workloads for it, but nothing could even pay for the power costs. The GPUs were very old models (rx470-rx580) and the rest of the hardware wasn't that advanced, like it is in AI, so none of it was transferred.

I'm in the process of building my own AI supercomputer now. Really looking forward to seeing how it turns out.

alecco 980 days ago

Make a vid. Or a blog post, at least. Please :)

latchkey 980 days ago

Thanks, but not my style, sorry! I've been doing PoW mining since 2014 and have so many stories, I've forgotten half of them. I wouldn't even know where to start on trying to document any of it.

ushakov 981 days ago

Did you manage to recoup the investment?

latchkey 981 days ago

Of course I can't say anything about that other than I did the job I was hired to do, and I performed far above anyone's wildest expectations.

Nobody else on the planet was able to automate the tuning like I did, which had a direct influence on ROI. I know this because it required a very specific change to the AMD drivers to enable that functionality to happen.

peteradio 981 days ago

Classified I imagine.

_zoltan_ 981 days ago

H100 based DGX/HGX doesn't use 800 Gbit (it doesn't have the PCI-e bw), it's using 400 per GPU.

latchkey 981 days ago

I was talking about between nodes. We're planning on bonding 2x400G NICs to get that 800G between nodes.

That said, latest 4th gen nvlink is 900G...

https://www.nvidia.com/en-us/data-center/nvlink/

But unless you're sleeping with Jensen, you're not going to see it for 52 weeks+ after you order it.

_zoltan_ 977 days ago

between the GPUs you already have 3.2Tbit/s, plus the 2x400 separately. Pretty sweet.

Our lead time hasn't been horrible actually, but I work for a pretty big corp

latchkey 977 days ago

It is amazing to me how it is all about who you know. We just got a higher level contact and magically nvidia nic's just appeared in our BOM.

joshhart 981 days ago

This is a big cluster, definitely large enough to pretrain 100B+ parameter LLMs in months. Source - I work at Databricks in the ML platform.

I don’t know much about AV processing, that’s highly customized to only a few customers but I’d expect it to also have very large computational requirements to do video processing and reinforcement learning.

kcb 981 days ago

The most powerful listed supercomputer has 37,888 Radeon GPUs, so in the same order of magnitude.

jbverschoor 981 days ago

Interesting choice of words... I take you work for OpenAI? :) How large is their/'your' cluster? Probably the biggest in the world by now..

kkielhofner 981 days ago

Parent is almost certainly talking about Frontier, the supercomputer with the US Department of Energy[0].

[0] - https://top500.org/system/180047/

jbverschoor 981 days ago

Yes, that's "listed".. I'm curious how big the "unlisted" cluster is.

kcb 981 days ago

Unfortunately no, but there are almost certainly clusters in the hands of private companies and government organizations that would prefer not to advertise their capabilities.

ben_w 981 days ago

Last I heard, the estimate was that NVIDIA would build 550k units in 2023, so 2% of all production — especially as at least six others (your four plus Apple and at least one intelligence agency) will be of similar size by themselves — is certainly non-negligible.

2OEH8eoCRo0 981 days ago

550k H100s? Who is buying these? They are hella expensive and China isn't allowed to have them.

ben_w 981 days ago

Other than the ~12% I just estimated, lots of large-but-not-famous places will be buying ~1k, and small places will be buying tens to hundreds, and quite a lot of AI bubble money will be invested in startups that claim they only need one.

Probably some scientific modelling that can be done on these, so I bet some universities and private labs will be buying them. NASA, SpaceX, RocketLab, Helion, etc.

There's also probably a lot of AAA game studios and art studios for movies etc. who are each buying dozens of these graphics processing units for… graphics :P

alecco 980 days ago

Government agencies.

ushakov 981 days ago

The Big Cloud

xvilka 981 days ago

It's a small cluster the size of large cluster.

jdiez17 981 days ago

What happened to their custom hardware training stack Dojo? They had some interesting ideas there. The last I heard, they had one of those tiles "working" in the lab. Pretty far from a production setup.

I can imagine they either underestimated the software effort needed to squeeze as much performance as possible out of those things, or they underestimated the pace at which Nvidia scales FLOPS/$, or both.

vardump 981 days ago

Probably they want to have all and any compute they can have. This doesn't exclude Dojo nor the previous generation nvidia chips they already got.

martin8412 981 days ago

Vaporware, just like much of what Musk talks about.

s1gnp0st 981 days ago

Reusable rockets, electric cars, solar panels...

What would you say grants you the standing to opine here?

nickthegreek 981 days ago

All of his other false or misleading statements over the last 10 years.

s1gnp0st 980 days ago

When we've dealt with the oil companies, the chemical manufacturers dumping PFAS into our kids, and the industrial war machine, maybe then we can start complaining about the guy biting off more than he can chew trying to be constructive.

Until then, all of you sound vicious, bitter, and hypocritical.

nickthegreek 980 days ago

We are perfectly capable of having issues with all of those as well. I can still ask for a speedbump on my street while also voicing concern with military postering… Crazy I know! Musk passed ‘guy biting off more than he chew’ when he started accusing heroes of being pedophiles.

mmcwilliams 981 days ago

I'm fairly certain all of those existed prior to Musk's suggestion of them.

aik 981 days ago

He delivered on them though right? Also reusable rockets didn’t exist?

astrodust 981 days ago

John Carmack on a shoestring budget nearly got this working at Armadillo. If he had more money and time he would've had it working a half decade before SpaceX.

mmcwilliams 981 days ago

The McDonnell Douglas DC-X?

awestroke 981 days ago

You’re also wrong

mmcwilliams 981 days ago

Which one of those didn't?

martin8412 981 days ago

https://elonmusk.today/

Geee 980 days ago

Most of those things have happened actually, but the website makes it seems that they didn't. It just lists everything Elon has said, but doesn't track if they happened or not. This is a completely pointless website.

MarCylinder 981 days ago

Actually they're in the middle of production at TSMC. They have 10,000 units on order, to be delivered "in the coming year".

aik 981 days ago

What that he has talked about been vaporware?

tibbydudeza 981 days ago

That Telsa owners can use their cars to make money while they are working as robo taxis -let's just say he vastly underestimates effort it takes to make progress - FSD is not there yet.

aik 981 days ago

Vaporware assumes it will never happen. Is that the case you think or is it that he was vastly over optimistic? Very likely the latter.

voakbasda 981 days ago

FSD is currently in the quantum valley of product development: it is both vaporware and a shipping product.

nickthegreek 981 days ago

Vaporware makes no assumptions about the future. Everything is vaporware until it isn’t.

tibbydudeza 981 days ago

They will get there at some acceptable point but not with the tech in current Tesla's - the current compute module will need to be replaced - think they showed off HW 4 in lieu of HW 3.

Sucks to be if you paid the early bird fee for it.

martin8412 981 days ago

https://elonmusk.today/

aik 981 days ago

Hmm that website could be really interesting if it clearly didn’t try being misleading, e.g. a statement about something being in development but not yet out isn’t a promise that it will be out now. Really silly.

Also if it didn’t exclude everything that has been delivered. That long list would be interesting to see as well.

jeffbee 981 days ago

Dojo has always been a lie.

aik 981 days ago

Source? The article mentions they now have / use both.

millerm 981 days ago

Your assertion is inaccurate.

alecco 981 days ago

Original tweet: https://twitter.com/SawyerMerritt/status/1696011140508045660

Previus article: https://www.tomshardware.com/news/teslas-dollar300-million-a...

This is second-hand blogspam.

TonyTrapp 981 days ago

Tom's Hardware and Tech Radar belong to the same company. If you consider this to be blog spam, almost any news website these days would be blog spam.

queuebert 981 days ago

> almost any news website these days would be blog spam

Yes.

alecco 981 days ago

Almost everything is in the original tweet.

einpoklum 981 days ago

And the original tweet is very much kool-aid heavy, with "20x performance", "30x performance" claims about the switch from one card to the next.

dahart 981 days ago

> This AI cluster, worth more than $300 million, will offer a peak performance of 340 FP64 PFLOPS for technical computing and 39.58 INT8 ExaFLOPS for AI applications, according to Tom’s Hardware.

I was curious why this statement lead with fp64 flops (instead of fp32, perhaps), but I looked up the H100 specs, and NV’s marketing page does the same thing. They’re obviously talking about the H100 SXM here, which has the same peak theoretical fp64 throughput as fp32. The cluster perf is estimated by multiplying the GPU perf by 10k.

Also, obviously, int8 tensor ops aren’t ‘FLOPS’. I think Nvidia calls them “TOPS” (tensor ops). There is a separate metric for ‘tensor flops’ or TF32.

queuebert 981 days ago

In the old days, depending on architecture, fp64 performance could be atrocious even when fp32 was decent, so bragging about fp64 performance has an authenticity to it. Not all scientific computing requires 64 bits, but knowing that you can drop to high precision when necessary without penalty is nice.

Also, back in the day, integer ops were just called 'ops', grumble grumble. But yeah FLOPS specifically refers to floating point. Calling them TOPS doesn't make sense to me, since tensor cores were meant for matrix operation speedup, and these matrices are rarely integer.

dahart 981 days ago

Still true that fp64 throughput is lower for consumer GPUs - both NV and AMD. That’s kinda why I was curious about leading with that metric - outside of HPC and scientific applications, a lot of people don’t really need fp64, and the machine might normally have a much higher fp32 throughput.

> knowing you can drop to high precision when necessary without penalty is nice.

I guess I maybe don’t know why you’d ever have 1:1 fp32 and fp64 perf. Aren’t the fp64 multipliers (for example) basically 4x fp32 multipliers? I am under the possibly naive impression that if you have all the transistors for 1 fp64 core, that you’d end up with all the transistors you need for 2 or 4 fp32 cores. Maybe that’s not true today, but there does have to be at least 2x the transistors overall for 64-bit vs 32-bit, and lots of those should be shared or reusable, no? It doesn’t seem quite right to frame naturally higher 32-bit op throughput as a “penalty” on 64-bit ops. You’re asking the hardware to do more with 64, and it makes complete sense that given the exact same budget for bandwidth, energy, memory, compute, etc. that 32-bit ops would go faster, no? If the op throughput of fp64 and fp32 is the same, doesn’t that possibly imply that the fp32 ops are potentially being wasted / penalized, just for the sake of having matching numbers?

petermcneeley 981 days ago

This is also related to "fast" versions of all some operations. You might want the full 32 bit float but you dont want or need to do full precision division or sqrt operations. This is common in games/graphics and probably machine learning.

queuebert 981 days ago

You're right -- I have no idea why fp64 wouldn't be half the speed of fp32, and traditionally it is. I was simply taking them at their word. Maybe they're exaggerating or maybe they did what you suggest and hamstrung fp32.

petermcneeley 981 days ago

Nit: INT8 is not a floating point operation and thus cannot be used in the term "ExaFLOPS"

throwaway4good 981 days ago

I predict it will run for 5 years and then come up with the answer: FSD needs lidar.

kaycebasques 981 days ago

n00b questions from someone just beginning to get interested in HPC

I see mention of using this supercomputer for training models. Is that the only purpose? What other types of things do orgs usually do with these supercomputers?

Are there any good boots-on-the-ground technical blogs that provide interesting detail on day-to-day experiences with these things?

abatilo 981 days ago

As opposed to keeping all of your servers independent of each other, super computers are used any time you want to pretend the entire computer is one computer.

In other words, they're used when you want to share some kind of state across all of the computers, without the potential overhead of communicating to some other system like a database.

Physics simulations and like, molecular modeling come to mind as common examples.

In the case of ML training, model parameters and broadcasting the deltas that get calculated during training are that shared state.

dsco 981 days ago

Newbie question, could this cluster easily calculate the largest prime number? I've found that the largest known prime number was found back in 2018, which is a while back considering how compute has evolved.

astrodust 981 days ago

Finding the largest prime is more a contest of who's willing to commit the most ridiculous amount of compute to the goal than it is a mathematical obstacle.

The cost of finding the next prime is likely into the millions now.

cactusplant7374 981 days ago

Is FSD really a hardware problem for them?

amai 981 days ago

Do they also order a power plant for that cluster? Or how much energy does such a thing need?

bluelightning2k 981 days ago

It's funny - I'm listening to "The Founders" audiobook and right now they're telling the story of Elon Musk at PayPal wanting to rewrite for Windows server because Linux was too hard.

Weird to think that his next company's compute platform is this.

WendyTheWillow 981 days ago

Linux was a lot harder back then.

cactusplant7374 981 days ago

Harder for who? Elon certainly didn't have the technical chops to work with it.

WendyTheWillow 981 days ago

Harder for everyone, including his staff, who were asking him to move to Windows…

huggingmouth 981 days ago

He should have hired staff that is competent with the tech stack used at his company.

Unforced rewrites are usually always a bad idea.

WendyTheWillow 981 days ago

Oh that simple, huh? Too bad you weren’t around in the late 90s to explain this to him, and to help him find the extremely rare group of folks familiar with Linux…

bluelightning2k 981 days ago

Actually he was the other way around. He strongly pushing windows and the CTO and engineers from Confonity strongly wanted Linux

WendyTheWillow 981 days ago

“Wanting to rewrite for windows” is what you said; was that not accurate?

The key thing here however is that Elon didn’t want whatever he asked for in a vacuum, despite what the book says. Surely this was his engineer’s preference.

jbverschoor 981 days ago

So THAT's why my power blipped

7e 981 days ago

Only 10K?

iamgopal 981 days ago

It’s bottleneck on Nvidia side. They are producing less than Tesla consume. Tesla compute power will outclass many cloud provider combined in just three or four years with their own custom chip.

chollida1 981 days ago

> It’s bottleneck on Nvidia side. They are producing less than Tesla consume. Tesla compute power will outclass many cloud provider combined in just three or four years with their own custom chip.

That seems like a bold claim. Google, Microsoft and Meta make so much more money than Telsa that if making AI chips was so easy, then they could clearly out design and build Tesla without thinking too hard about it.

What makes you think that Telsa, a company with far less AI workers and knowledge, and far less money than the above companies can out design and out build them?

md_ 981 days ago

> What makes you think that Telsa, a company with far less AI workers and knowledge, an far less money than the above companies can out design and out build them?

Presumably because Elon himself will be involved in the design, and Elon, as we all know, is one of the world's great thinkers. ;)

Jensson 981 days ago

Elon is one of the worlds greatest talent poachers, and that is much better than being a great thinker.

md_ 981 days ago

Is he, though?

I recently spoke with someone who quit SpaceX because (among other reasons) they felt Elon was a meddling micro-manager. That's just one anecdote, of course, but the Internet is full of them, replete with summary firings (https://www.businessinsider.com/tesla-elon-musk-ruthlessly-f...), worker safety issues (https://www.washingtonpost.com/technology/2021/03/12/hundred...), and just general bullshit (https://www.reddit.com/r/EnoughMuskSpam/comments/9e360m/elon...).

I don't deny that his public image, for years, was an overall positive one. I really enjoyed Jill Lepore's digging into it here: https://www.pushkin.fm/podcasts/elon-musk-the-evening-rocket.

But it seems like people who worked with him knew, for a long time, that he was full of shit. And increasingly, the public seems to as well.

solarkraft 981 days ago

Do you mind explaining what makes him good at it? Pay? Atmosphere? Management style?

From what I heard about SpaceX it seems to be a place grads go to burn out while being paid below market rate simply because they're excited about the idea. Maybe that impression is wrong, so I'd like to hear other perspectives.

chollida1 981 days ago

Interesting, what makes you think this?

kkielhofner 981 days ago

The dirty little open secret with a lot of these platforms is the contract sizes, hardware costs, etc are so massive they come with multiple teams of dedicated engineers and internal expertise to get your application(s) up and running on them. Obviously these things are never quite "pull a docker container and run" and no one dropping eight-nine figures on these installs is going to do it without serious vendor backing and support.

It's part of the reason why AMD has had quite a bit of success here but is in single digit market share for "AI" otherwise.

Most people - even large orgs with thousands of GPUs - are so trapped in CUDA the theoretical on paper performance and cost benefits evaporate immediately when you spend all of your time trying to port everything over to the point you get equivalent performance and functionality.

candiddevmike 981 days ago

Got a source for that?

ComputerGuru 981 days ago

The original tweet makes the claim, but the tweet seems prone to hyperbole as well.

https://twitter.com/SawyerMerritt/status/1696011140508045660

mousetree 981 days ago

The original tweet quotes Elon Musk saying "Frankly...if they (NVIDIA) could deliver us enough GPUs, we might not need Dojo"

mousetree 981 days ago

$300 million for those 10,000

latchkey 981 days ago

much. much. more. You're not factoring in the disks, chassis, ram, networking gear, cabling, data center build, setup, install, etc etc etc...

kcb 981 days ago

> The firm also built a compute cluster fitted with 5,760 Nvidia A100 GPUs in June 2012

Wow, that's some really early hardware access. /s

Geee 981 days ago

Lol, I was wondering if A100 is really that old. Turns out A100 was released in 2020.

kcb 981 days ago

Yea I assume they meant 2021. 2012 was still the early days of GPU compute. Best we had were M2090s.

blackoil 981 days ago

Maybe, they picked up date when Elon first communicated that they are "ready" to go live. Like everything else it took a decade to materialize.

visarga 981 days ago

I only had about 3 NVIDIA H100 in 1980

kcb 981 days ago

Someone needs to figure out at what point all the compute in the world became more powerful than a single H100.

toomuchtodo 981 days ago

The Dojo is open.

kranke155 981 days ago

I thought Dojo was custom chips.

ComputerGuru 981 days ago

You are correct; it is, and flippant HN comments that are additionally incorrect are starting to become a thing. See the original tweet: https://twitter.com/SawyerMerritt/status/1696011140508045660

toomuchtodo 981 days ago

You’re being pedantic (rightfully so) and I’m being loose with words. While Dojo is the supercomputer Tesla built for vision training, I lumped anything contributing to their machine vision model training as Dojo. It’s called Dojo because that’s where the training takes place.

https://en.wikipedia.org/wiki/Tesla_Dojo

From the History section (although Technical Architecure is also worthy of consuming in its entirety):

> In August 2023, Tesla powered on Dojo for production use as well as a new training cluster configured with 10,000 Nvidia H100 GPUs.

I’ll take the L wrt being flippant if we’re using words very specifically in this context, that’s fair. It’s great to see Tesla expand its training resources is my sentiment, regardless of how their aggregate ML compute is segmented.

ra7 981 days ago

You’re just using it wrong. Dojo “supercomputer” specifically includes custom chips, which doesn’t exist yet.

tomaytotomato 981 days ago

Pain does not exist in this Dojo. Kiai!

andrewmcwatters 981 days ago

Can you imagine how much power 10,000 H100s actually produces in production? I bet you'd be able to run modern games on a cluster that large at a full 60 FPS.

Kevcmk 981 days ago

Nvidia is powering a mega Tesla supercomputer powered by 10,000 H100 GPUs

ComputerGuru 981 days ago

Did you just repeat the headline?