| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by the_panopticon 258 days ago
	Very interesting. It sounds like tuning at the PTX level can increase workload efficiencies, such as quote "Specifically, we employ customized PTX (Parallel Thread Execution) instructions" from the DeepSeek folks https://arxiv.org/abs/2412.19437.

2 comments

shetaye 258 days ago

Agreed! The gulf between pure-C++ CUDA and PTX is getting larger with these optimizations. My understanding is that Deepseek used PTX instructions that either had no corresponding C++ implemented (like `wgmma` mentioned in the article) or uncommon permutations of modifiers (`LD.Global.NC.L1::no_allocate.L2::256b`).

link

saagarjha 258 days ago

They didn’t employ custom PTX instructions; they used existing ones in ways they were not designed to be used.

link