| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by shetaye 304 days ago
	Agreed! The gulf between pure-C++ CUDA and PTX is getting larger with these optimizations. My understanding is that Deepseek used PTX instructions that either had no corresponding C++ implemented (like `wgmma` mentioned in the article) or uncommon permutations of modifiers (`LD.Global.NC.L1::no_allocate.L2::256b`).