|
|
|
|
|
by nkurz
4155 days ago
|
|
I tried the last option I mentioned (unconditional sequential writes and random reads), and got approximately the results I expected on Haswell. I'm managing one read/write combo about every 8 cycles, which I think is about the maximum throughput that can be expected for random reads from RAM by Little's law. Somewhat as expected, the results for Karim's approach were faster if the input array was very sparse. Unexpectedly, a simple conditional on a basic approach performed better for me on Haswell than Karim's. It's possible I did something to damage the code when I sprinkled it with 'unsigned' to quiet the warnings about signed/unsigned comparisons. I didn't find any benefit to prefetching for code that was already fast. I wasn't up to trying the vectorized approach. For reasons I don't understand, performance for just about everything I tried was abysmal on Sandy Bridge. For many cases, I was finding 3-5x better absolute performance on Haswell, both for small and large arrays. I put my test code up here: https://gist.github.com/nkurz/43bd7754155d63381758 |
|