| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rx_tx 506 days ago
	This article has good background, context, and explanations [1] They skipped CUDA and instead used PTX which is a lower level instruction set where they were able to implement more performant cross-chip comms to make up for the less-performant H800 chips. [1]: https://stratechery.com/2025/deepseek-faq/

2 comments

saagarjha 506 days ago

> Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA.

You can do this just fine in CUDA, no PTX required. Of course all the major shops are using inline PTX at the very least to access the Tensor cores effectively.

link

yread 506 days ago

So can people do the same in SPIR for OpenCL or amdgcn?

https://en.wikipedia.org/wiki/Standard_Portable_Intermediate...

https://www.khronos.org/spir/

Or even better in the unified language like SYCL?

https://cdrdv2-public.intel.com/786536/Heidelberg_IWOCL__SYC...

link