Hacker News new | ask | show | jobs
by rx_tx 506 days ago
This article has good background, context, and explanations [1] They skipped CUDA and instead used PTX which is a lower level instruction set where they were able to implement more performant cross-chip comms to make up for the less-performant H800 chips.

[1]: https://stratechery.com/2025/deepseek-faq/

2 comments

> Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA.

You can do this just fine in CUDA, no PTX required. Of course all the major shops are using inline PTX at the very least to access the Tensor cores effectively.

So can people do the same in SPIR for OpenCL or amdgcn?

https://en.wikipedia.org/wiki/Standard_Portable_Intermediate...

https://www.khronos.org/spir/

Or even better in the unified language like SYCL?

https://cdrdv2-public.intel.com/786536/Heidelberg_IWOCL__SYC...