Hacker News new | ask | show | jobs
by siboehm 1264 days ago
Author here: Seems like a good trick! Though won't this affect shared memory alignment and make me loose those LDS.128 instructions? Or do these not require alignment? There's so little good docs on SASS.

In general I'm still confused about whether vectorized load instructions (LDS.128) necessarily lead to bank conflicts or not. My impression was that consecutive 32b floats get mapped to different banks, so to avoid conflicts I'd want the warp to load 32*32b consecutive elements at each step.

1 comments

Hmm, I think you might have to adjust the padding to be 128 bits then:

    __shared__ float As[(CHUNKSIZE+4) \* CHUNKSIZE]
Ultimately it's down to trial and error, like always with GPGPU.