|
|
|
|
|
by jhawthorn
5171 days ago
|
|
Oh, no! Missed noticing my first link from HN for a few days. Hopefully I can clarify some of this. First, this is in no way faster than cuSPARSE or cusp. I wrote this for an originally for a school assignment (hence older cuda version) and was hoping to convey what I had learned my first time using cuda. The size of the shared memory is so the reduction has no buffer overflow without using conditionals. However I am using more than needed, it should be set to 32+16. I don't expect this to affect performance as the kernel already reaches 100% theoretical occupancy. Could you explain the desire for more threads/blocks per row? I can't immediately think of one, having more additions per thread sounds good so long as all are running. Thanks very much for the read and the reply! |
|