|
|
|
|
|
by jhokanson
1591 days ago
|
|
It is not exactly clear to me what is going on with threads (I guess you are using all of them?). I haven't done too much in this space but anecdotally I've had better luck if my summation is explicitly split into sub-summation tasks. It is not clear if that is being done here. It looks like a single summation loop that the author is expecting the computer to magically split across multiple threads. I'd be interested in seeing what this looks like if instead the task were to add chunks of the original dataset into results per thread (e.g, first 8000 samples on first thread, next 8000 on 2nd thread, etc.), with a final accumulation loop across all threads. Again, the author may be trying this and this is not my area of expertise but I've had decent luck saturating the memory bus with a similar approach. |
|
OFC we don't expect the compiler to instantiate them for us, it's not OpenMP :) That one we covered in previous articles. OpenMP gave us about 50 GB/s with all cores enabled and 80 GB/s with part of them disabled.