| HN Mirror

> Interesting, anything in particular to read about it from Anandtech or Chips+Cheese or what? Or should I just read the whitepapers? (/sigh, reading primary sources, my only weakness)

I haven’t had great luck with consulting non-primary-sources, most of my knowledge comes from reading NVIDIA blog posts and GTC presentations as they become relevant. Lately I’ve been working with CUTLASS and reading through that documentation — maybe start with their presentations and work back through their references? I’ve learned a lot by reading the architecture tuning guides from NVIDIA, too.

> Does dumping warp-sorted data into the buffer help? Eg can each warp or each block sort their output so what they're dumping in can be "galloped", like a partially-sorted mergesort or something? Finding those boundaries between output cells is (ironically) another prefix-scan lol. > > Is it just that you need a different sort, or does sorting just not work that well now?

I’m not sure, honestly! All I know is that recently I’ve been looking at radix sort kernels with lower-than-expected memory throughout and low cache hit rates :)

> Are prefix-scans still good at all?

The CUB linear-time prefix scan kernels seem to be fantastic still, they operate basically at the speed of DRAM with really high compute utilization. When I’ve seen lower-than-expected performance with these kernels, it’s because of an issue with an inefficient transform being made as part of the input/output iterator ranges, or because of some local memory usage due to an indexing operation in a local variable that couldn’t be fully inlined into registers.