UPDATE: To my surprise and after much fiddling, I did not manage to write
a version that was measurably faster (indeed they were at least a percent slower) than the hand written sum_avx512 shown below. There is almost certainly something that I am doing wrong but I can’t seem to figure out what it is. I will take this opportunity to leave this as an exercise for the reader :).
"
" UPDATE: see https://www.realworldtech.com/forum/?threadid=200693&curpost... for a dramatic simplification. Not catching this is an oversight on my part. This post will be updated to include numbers with the mentioned strategy.
UPDATE: To my surprise and after much fiddling, I did not manage to write a version that was measurably faster (indeed they were at least a percent slower) than the hand written sum_avx512 shown below. There is almost certainly something that I am doing wrong but I can’t seem to figure out what it is. I will take this opportunity to leave this as an exercise for the reader :). "