|
|
|
|
|
by StopDisinfo910
165 days ago
|
|
Not really, no. The main difference is that you don't know how many code points you have in the prefix as they use variable encoding so it can be up to four but as little as one. I imagine the choice of four bytes for the prefix was actually done specifically for this reason. That's the maximum length of a UTF-8 code point. The length is not the number of characters anymore but just the size of the string. Apart from that, it should work exactly the same. |
|
Also, for UTF8 specifically, cutting code points in half is fine as long as all strings are valid UTF8. The UTF8 encoding is prefix free, i.e., no valid code point is a prefix of another valid code point, so for prefix matching we can usually just compare bytes.
It only gets more complicated if you add collations or want to match case-insensitively. But at that point you need to take into account all edge cases of the Unicode spec anyway.