I read it. I don't see your point. Sentinel values simply give you a very low performance ceiling that you can't optimise past. What I wrote only took me a couple minutes. It's far from optimal (but it already takes 2/3 the time the sentinel version takes, and 1/2 the time Lemire's range version takes). With thorough effort I bet you could get at least 10 times faster than that, provided you are given the length. With sentinels, you're just left with the original performance levels you can't improve.
Results:
(ratio is range/sentinel, not mine/whatever)You can get even more crazy with SIMD. But for all that you need to know the length beforehand.
Edit: The b4 variable should actually be called b8. That's a remnant of a previous version, where I used 32-bit chunks.