| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Bimos 519 days ago
	The PTX instructions they talked about in the tech report should be pointing to the code here?

2 comments

zardinality 519 days ago

"For extreme performance, we discover and use a behavior-out-of-doc PTX instruction: ld.global.nc.L1::no_allocate.L2::256B. This instruction will lead to an undefined behavior: accessing volatile GPU memory with non-coherent read-only PTX modifiers .nc. But the correctness is tested to be guaranteed with .L1::no_allocate on Hopper architectures, and performance will be much better. If you find kernels not working on some other platforms, you may add DISABLE_AGGRESSIVE_PTX_INSTRS=1 to setup.py and disable this, or file an issue."

link

magicalhippo 519 days ago

So non-coherent refers to bypassing cache coherency, ie don't care about what other units might have written to that address? And the L1/L2 modifiers are to avoid L1 thrashing, keeping the value in L2 only?

Or did I get that wrong?

link

ta988 519 days ago

My understanding of the L2 part is that it asks for a 256b pre-fetch (only available on some platforms it seems) but they use vectors of 4 32bits signed ints max so not sure why only the 256 would work or if the fact that it did fetch the next 128 helps.

link

saagarjha 519 days ago

Yeah that's about right

link

helloericsf 519 days ago

this might help: https://x.com/main_horse/status/1894215779521794058/photo/1

link