That's true, and that's why your typical string vector code has a prelude and a postlude to do the incomplete chunks at the ends. Between the ends, it's processing larger self-aligned chunks.
You didn’t really say that, but feel free to share any reasons you might have to think so.
I don’t see any reason why it wouldn’t be perfectly fine on recent hardware, where unaligned loads are just as fast, and the cache pressure is identical for a linear search algorithm.