| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zardinality 482 days ago
	"For extreme performance, we discover and use a behavior-out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue."

1 comments

magicalhippo 482 days ago

So non-coherent refers to bypassing cache coherency, ie don't care about what other units might have written to that address? And the L1/L2 modifiers are to avoid L1 thrashing, keeping the value in L2 only?

Or did I get that wrong?

link

ta988 482 days ago

My understanding of the L2 part is that it asks for a 256b pre-fetch (only available on some platforms it seems) but they use vectors of 4 32bits signed ints max so not sure why only the 256 would work or if the fact that it did fetch the next 128 helps.

link

saagarjha 482 days ago

Yeah that's about right

link