|
|
|
|
|
by rx_tx
506 days ago
|
|
This article has good background, context, and explanations [1] They skipped CUDA and instead used PTX which is a lower level instruction set where they were able to implement more performant cross-chip comms to make up for the less-performant H800 chips. [1]: https://stratechery.com/2025/deepseek-faq/ |
|
You can do this just fine in CUDA, no PTX required. Of course all the major shops are using inline PTX at the very least to access the Tensor cores effectively.