Hacker News new | ask | show | jobs
by simonask 26 days ago
Strings typically consist of UTF-8 bytes, and any old `char*` pair has no alignment guarantees.
1 comments

That's true, and that's why your typical string vector code has a prelude and a postlude to do the incomplete chunks at the ends. Between the ends, it's processing larger self-aligned chunks.
If you're aware of that technique, why were you asking about use cases for unaligned loads?
I'm saying that string search algorithms are _not_ a legitimate use case for unaligned loads.
You didn’t really say that, but feel free to share any reasons you might have to think so.

I don’t see any reason why it wouldn’t be perfectly fine on recent hardware, where unaligned loads are just as fast, and the cache pressure is identical for a linear search algorithm.

I asked where is the part about unaligned pointers in your string processing example. Saying that you want to load multiple bytes at a time does not imply at all that you have to do unaligned loads.

Doing unaligned loads using SSE or AVX might have been possible on Intel architectures for a long time, but it is still a little bit slower afaik. But anyway when you get into sub-architecture specific details like that, you've essentially left C-land, and you're essentially doing assembler level programming.

Every vectorized string search algorithm (including those treating an unsigned long as a "vector" of 8 bytes) currently needs a prelude that performs the search up to the first alignment boundary, and then performs the bulk of the search on well-aligned blocks, and then finally a postlude search in the tail of the string, where the tail is shorter than the block size/alignment.

Using unaligned loads, you can get rid of the prelude, including the associated branches and intptr arithmetic, and just have to deal with the tail.

If you're comparing short-ish strings, almost all of the time is spent in the prelude and postlude, even if the entire substring fits in a register. This is a silly language limitation when the hardware can actually easily just support the unaligned load.

In particular, it doesn't seem justified that what at most amounts to a tiny inefficiency in hardware turns into a very expensive class of bugs (UB).