Hacker News new | ask | show | jobs
by ABS 833 days ago
H100 was released almost exactly 1 year ago so I guess it's ok if Intel is now ready to compete with last year's model.

To those commenting about "no moat" remember CUDA is a huge part of it, it's actually HW+SW and both took a decade to mature, together

2 comments

It took less than a day to port our code over, we do custom CUDA across modalities.

Gaudi2 was actually announced 2 years ago and is 7nm like the A100 80Gb it was meant to be competitive with, Gaudi3 later this year is probably going to be the inflection point as that ramps

The cost is like 1/3

https://www.intel.com/content/www/us/en/newsroom/news/vision...

"Announced" 2 years ago is different from its availability and ability:

- Intel acquired Habana in 2019

- Habana launched Gaudi2 in 2022

- only in H2 2023 Habana enabled FP8 which delivered around 100% improvement in time-to-train

On the rest I believe you but markets don't move based on single individual's/company's data points

Gaudi2s started coming out in 2022 (https://huggingface.co/blog/habana-gaudi-2-benchmark) but didn't hit mass scale. I think Gaudi3 will & others have seen similar performance for Gaudi2 eg Databricks: https://www.databricks.com/blog/llm-training-and-inference-i..., mlperf etc

We are about to drop stable diffusion 3 which is the best image model out there (https://x.com/EMostaque/status/1764941367682256950?s=20) with similar architecture to Sora by OpenAI that can be used for any modality.

We have hundreds of millions of downloads of our models so are looking for big scale as we move to every pixel being generated & this stuff goes from research to mass deployment.

  2024:  Nvidia's B100 TSMC 3nm (?)
  2024:  Intel Gaudi3  TSMC 5nm (*)
  2023:  AMD MI300X    TSMC 5nm/6nm 
  2022:  Nvidia H100   TSMC 4N
  2020   Nvidia A100   TSMC 7nm

(*): performance critical chiplets at least.
Falcon shores next year will be crazy with 300gb VRAM & new lith
The fact that AMD's GPGPU platform is buggy for consumers has more to do with incompetence and product cannibalisation than the difficulty of building properly working drivers. Machine learning uses profoundly simple operations. Building a pytorch backend isn't difficult if the drivers are working properly.