|
|
|
|
|
by xoranth
1072 days ago
|
|
That's likely the fastest way to do that without vectorization. But you'd need to upcast 's' to an uint64 (or at least an uint32).
That means that vectorization would operate on 32/64 bit lanes. With vectorization, I think the way to go is to have two nested loops, an outer advances by 32 * 255 elements at a time, and an inner one that loads 32 bytes, compares each character to 's', and accumulates on 8 bit lanes. Then in the outer loop you do an horizontal sum of the 8 bit accumulators. |
|