|
|
|
|
|
by imron
3371 days ago
|
|
> UTF-8 strings can have zero bytes in them This is not true. A zero-byte in a utf-8 string is the null-terminator and utf-8 strings can be treated exactly like C strings in terms of where the string ends. What you do need to look out for is malformed utf-8, for example, 1 byte before the null terminator you get a lead byte saying the next character is 4-bytes long. If you're not checking each byte for null and just skipping based on the length indicated by the lead byte then you're in for a crash. Where utf-8 strings differ from C strings is slicing. You can't just slice the string at some random point without doing extra validation to make sure you only slice on codepoint boundaries. |
|
No, the parent was correct: UTF-8 encodes NUL (i.e. \0) as a single zero byte (e.g. in contrast, Modified UTF-8[1] uses an overlong for NUL, so there's never any possibility of an internal zero). Of course, an application/library can choose to restrict itself to only handling UTF-8 that doesn't contain internal NULs, but the spec itself allows for zero bytes in a string.
[1]: https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8