Oh duh, thanks. (Checking less than zero, read it wrong.)
I think it would be faster to OR the entire string with itself, then finally check the 8th bit though. On Skylake that would cut it to 0.33 cycles per 16 bytes (HSW 1 per 16).
Depends on your input. If non-ASCII strings are frequent and likely to contain a non-ASCII character fairly close to the start of the string, then it makes sense to short circuit.
I think it would be faster to OR the entire string with itself, then finally check the 8th bit though. On Skylake that would cut it to 0.33 cycles per 16 bytes (HSW 1 per 16).
https://github.com/lemire/fastvalidate-utf-8/pull/2