Hacker News new | ask | show | jobs
by jhawthorn 5171 days ago
Oh, no! Missed noticing my first link from HN for a few days. Hopefully I can clarify some of this.

First, this is in no way faster than cuSPARSE or cusp. I wrote this for an originally for a school assignment (hence older cuda version) and was hoping to convey what I had learned my first time using cuda.

The size of the shared memory is so the reduction has no buffer overflow without using conditionals. However I am using more than needed, it should be set to 32+16. I don't expect this to affect performance as the kernel already reaches 100% theoretical occupancy.

Could you explain the desire for more threads/blocks per row? I can't immediately think of one, having more additions per thread sounds good so long as all are running.

Thanks very much for the read and the reply!