Hacker News new | ask | show | jobs
by aeruder 2955 days ago
The trick here is that each byte is treated as a signed 8-bit number. When the top bit is set, the number is negative.
1 comments

Oh duh, thanks. (Checking less than zero, read it wrong.)

I think it would be faster to OR the entire string with itself, then finally check the 8th bit though. On Skylake that would cut it to 0.33 cycles per 16 bytes (HSW 1 per 16).

https://github.com/lemire/fastvalidate-utf-8/pull/2

Depends on your input. If non-ASCII strings are frequent and likely to contain a non-ASCII character fairly close to the start of the string, then it makes sense to short circuit.
The previous >0 algorithm didn't short-circuit. There is no change to short-circuit behavior here.
Ah, I was thinking of the naive implementation in the previous post [1].

[1] https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-...

> I think it would be faster to OR the entire string with itself, then finally check the 8th bit though.

The string could have NUL (zero) bytes in between.

You're right that it changes the behavior vs. what the current implementation is, but 0x0 is a valid ASCII character.
While you're technically right NUL is a part of ASCII set in practise it's rarely wanted in the data.