Hacker News new | ask | show | jobs
by camel-cdr 702 days ago
Couldn't you organize the accumulators in 8 byte chunks, and leave the upper byte unused. Then you map consecutive digits to those chunks and use 64 bit addition for the accumulation. Then overflow between the bytes would keep the correct result if you do the shuffles correctly, and you have a full byte of overflow buffer.
1 comments

Gaps in the numbers are often enough to do some kind of "SIMD" even on ordinary 32-bit processors.
Yeah, but I was thinking of doing this within the vector registers to increase the batch size.