| HN Mirror

There's actually little that changed in a way too fundamentally to matter other than _perhaps_ getting the async load-from-global-to-shared-memory DMA memcpy that avoided blocking register file space as target buffers for in-flight read-from-global operations. Shared after all is just a partition of L1d$ since iirc Volta (since they offered non-fixed/at-launch-requested expanded shared capacity support), so it made sense to provide this not-just-a-hint "prefetch into this user-managed slice of what is otherwise L1d$": it's AFAIK basically just some special load-like units that ask special L1d$-miss-fill units to deliver to a now-explicitly-specified target location in the non-automatic-cache partition of the local SRAM and signal completion in otherwise fairly normal local semaphore/barrier fashion.

The major difference is that this doesn't have a natural moment to transform/touch the values after read from global and before storage to shared.

Otherwise, tiled MMA (gemm) kernels where normal even in Maxwell days (after the classic K80, before the P100; Maxwell is when H.265 support landed).