| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by neilmovva 1183 days ago
	A bit underwhelming - H100 was announced at GTC 2022, and represented a huge stride over A100. But a year later, H100 is still not generally available at any public cloud I can find, and I haven't yet seen ML researchers reporting any use of H100. The new "NVL" variant adds ~20% more memory per GPU by enabling the sixth HBM stack (previously only five out of six were used). Additionally, GPUs now come in pairs with 600GB/s bandwidth between the paired devices. However, the pair then uses PCIe as the sole interface to the rest of the system. This topology is an interesting hybrid of the previous DGX (put all GPUs onto a unified NVLink graph), and the more traditional PCIe accelerator cards (star topology of PCIe links, host CPU is the root node). Probably not an issue, I think PCIe 5.0 x16 is already fast enough to not bottleneck multi-GPU training too much.

4 comments

binarymax 1183 days ago

It is interesting that hopper isn’t widely available yet.

I have seen some benchmarks from academia but nothing in the private sector.

I wonder if they thought they were moving too fast and wanted to milk amphere/ada as long as possible.

Not having any competition whatsoever means Nvidia can release what they like when they like.

link

pixl97 1183 days ago

The question is, do they not have much production, or is OpenAI and Microsoft buying every single one they produce?

link

TylerE 1183 days ago

Why bother when you can get cryptobros paying way over MSRP for 3090s?

link

andy81 1183 days ago

GPU mining died last year.

There's so little liquidity post-merge that it's only worth mining as a way to launder stolen electricity.

The bitcoin people still waste raw materials, and prices are relatively sticky with so few suppliers and a backlog of demand, but we've already seen prices drop heavily since then.

link

TylerE 1183 days ago

Right, that's why NVidia is acutally trying again. The money printer has run out of ink.

link

binarymax 1183 days ago

Not just cryptobros. A100s are the current top of the line and it’s hard to find them available on AWS and Lambda. Vast.AI has plenty if you trust renting from a stranger.

AMD really needs to pick up the pace and make a solid competitive offering in deep learning. They’re slowly getting there but they are at least 2 generations out.

link

fbdab103 1183 days ago

I would take a huge performance hit to just not deal with Nvidia drivers. Unless things have changed, it is still not really possible to operate on AMD hardware without a list of gotchas.

link

brucethemoose2 1183 days ago

Its still basically impossible to find MI200s in the cloud.

On desktops, only the 7000 series is kinda competitive for AI in particular, and you have to go out of your way to get it running quick in PyTorch. The 6000 and 5000 series just weren't designed for AI.

link

breatheoften 1183 days ago

It's crazy to me that no other hardware company has sought to compete for the deep learning training/inference market yet ...

The existing ecosystems (cuda, pytorch etc) are all pretty garbage anyway -- aside from the massive number of tutorials it doesn't seem like it would actually be hard to build a vertically integrated competitor ecosystem ... it feels a little like the rise of rails to me -- is a million articles about how to build a blog engine really that deep a moat ..?

link

KeplerBoy 1183 days ago

How could their moat possibly be deeper?

First of all you need hardware with cutting-edge chips. Chips which can only be supplied by TSMC and Samsung.

Then you need the software ranging all the way from the firmware and driver over something analogous to CUDA with libraries like cuDNN, cuBLAS and many others to integrations into pytorch and tensorflow.

And none of that will come for free, like it came to Nvidia. Nvidia built CUDA and people built their DL frameworks around it in the last decade, but nobody will invest their time into doing the same for a competitor, when they could just do their research on Nvidia hardware instead.

Realistically it's up to AMD or Intel.

link

rcme 1183 days ago

There will probably be Chinese options as well. China has an incentive to provide a domestic competitor due to deteriorating relations with the U.S.

link

runnerup 1183 days ago

No other company has sought this?

https://www.cerebras.net/ Has innovative technology, has actual customers, and is gaining a foothold in software-system stacks by integrating their platform into the OpenXLA GPU compiler.

link

wmf 1183 days ago

There are tons of companies trying; they just aren't succeeding.

link

__anon-2023__ 1183 days ago

Yes, I was expecting a RAM-doubled edition of the H100, this is just a higher-binned version of the same part.

I got an email from vultr, saying that they're "officially taking reservations for the NVIDIA HGX H100", so I guess all public clouds are going to get those soon.

link

rerx 1183 days ago

You can also join a pair of regular PCIe H100 GPUs with an NVLink bridge. So that topology is not so new either.

link

ksec 1183 days ago

>H100 was announced at GTC 2022, and represented a huge stride over A100. But a year later, H100 is still not generally available at any public cloud I can find

You can safely assume an entity bought as many as they could.

link