|
|
|
|
|
by saagarjha
510 days ago
|
|
> Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. You can do this just fine in CUDA, no PTX required. Of course all the major shops are using inline PTX at the very least to access the Tensor cores effectively. |
|