| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gregjm 1118 days ago

Personally I would lean away from radix sort once the problem size gets too large, on Turing they made the per-SM L1 cache coherent and on Ampere they made L2 cache significantly larger so getting maximum throughput requires more actively designing for cache-friendly access patterns. The access pattern of radix sort will be very strided, so you get low cache utility for those longer distance swaps. If your problem fits into shared memory or even registers (for in-warp shuffles) then definitely, radix sort is super efficient!

Memory coalescing probably got a little more lenient than when you were programming. If any group of threads in a warp is accessing a 128B size-and-aligned region of memory, that’s perfectly matched the 128B line size between L1TEX cache and L2 cache and you get coalesced accesses. Even 32B size-and-aligned (e.g. each pair of threads accesses two halves of a 32B region) isn’t all that bad, since it aligns with the 32B L2 cache line size to global memory. You run the risk of getting poor load/store utility if you don’t access the other 96B in that L1TEX line somewhere in the same block, though.

Constant memory is less important than it used to be, since constant memory goes through the texture memory, and the L1 cache was unified and combined with the texture cache. There’s still an LDG.CONSTANT SASS instruction on Turing, though, so there’s still a difference in the hardware that escapes me right now. IIRC you can write a MOV instruction with the source operand in constant memory (rather than LDG.CONSTANT) which doesn’t put load on the load-store unit (LSU). Could be useful if you’re getting warp stalls due to memory pipe throttling.

1 comments

paulmd 1118 days ago

Interesting, love it!

> on Turing they made the per-SM L1 cache coherent and on Ampere they made L2 cache significantly larger

Yeah Turing/Ampere are more than people give them credit for. They are quite a bit more powerful in compute in general iirc, the core is really Volta-descended iirc and not Pascal.

> so getting maximum throughput requires more actively designing for cache-friendly access patterns. The access pattern of radix sort will be very strided, so you get low cache utility for those longer distance swaps. If your problem fits into shared memory or even registers (for in-warp shuffles) then definitely, radix sort is super efficient!

Does dumping warp-sorted data into the buffer help? Eg can each warp or each block sort their output so what they're dumping in can be "galloped", like a partially-sorted mergesort or something? Finding those boundaries between output cells is (ironically) another prefix-scan lol.

Is it just that you need a different sort, or does sorting just not work that well now?

Are prefix-scans still good at all?

Interesting, anything in particular to read about it from Anandtech or Chips+Cheese or what? Or should I just read the whitepapers? (/sigh, reading primary sources, my only weakness)

(I really wish there was an Agner Fog for GPUs!)

> IIRC you can write a MOV instruction with the source operand in constant memory (rather than LDG.CONSTANT) which doesn’t put load on the load-store unit (LSU).

hackerman.jpg

gregjm 1115 days ago

> Interesting, anything in particular to read about it from Anandtech or Chips+Cheese or what? Or should I just read the whitepapers? (/sigh, reading primary sources, my only weakness)

I haven’t had great luck with consulting non-primary-sources, most of my knowledge comes from reading NVIDIA blog posts and GTC presentations as they become relevant. Lately I’ve been working with CUTLASS and reading through that documentation — maybe start with their presentations and work back through their references? I’ve learned a lot by reading the architecture tuning guides from NVIDIA, too.

> Does dumping warp-sorted data into the buffer help? Eg can each warp or each block sort their output so what they're dumping in can be "galloped", like a partially-sorted mergesort or something? Finding those boundaries between output cells is (ironically) another prefix-scan lol. > > Is it just that you need a different sort, or does sorting just not work that well now?

I’m not sure, honestly! All I know is that recently I’ve been looking at radix sort kernels with lower-than-expected memory throughout and low cache hit rates :)

> Are prefix-scans still good at all?

The CUB linear-time prefix scan kernels seem to be fantastic still, they operate basically at the speed of DRAM with really high compute utilization. When I’ve seen lower-than-expected performance with these kernels, it’s because of an issue with an inefficient transform being made as part of the input/output iterator ranges, or because of some local memory usage due to an indexing operation in a local variable that couldn’t be fully inlined into registers.