Hacker News new | ask | show | jobs
by dbaupp 3376 days ago
What do you mean by "character"? If you mean code point or "unicode scalar value", sure, but if you mean user-visible character (grapheme), it's much more complicated: even something "simple" like รถ could be one or two code points.
1 comments

I mean your iterator is char* and you advance it by adding. That's it.

I do NOT mean that char itself corresponds to a glyph or codepoint, you are seriously preaching to the choir making that lecture to me.

>you advance it by adding

And when do you stop? UTF-8 strings can have zero bytes in them so treating them as C strings is potentially error prone depending on the context.

> UTF-8 strings can have zero bytes in them

This is not true. A zero-byte in a utf-8 string is the null-terminator and utf-8 strings can be treated exactly like C strings in terms of where the string ends.

What you do need to look out for is malformed utf-8, for example, 1 byte before the null terminator you get a lead byte saying the next character is 4-bytes long.

If you're not checking each byte for null and just skipping based on the length indicated by the lead byte then you're in for a crash.

Where utf-8 strings differ from C strings is slicing. You can't just slice the string at some random point without doing extra validation to make sure you only slice on codepoint boundaries.

> A zero-byte in a utf-8 string is the null-terminator and utf-8 strings can be treated exactly like C strings in terms of where the string ends.

No, the parent was correct: UTF-8 encodes NUL (i.e. \0) as a single zero byte (e.g. in contrast, Modified UTF-8[1] uses an overlong for NUL, so there's never any possibility of an internal zero). Of course, an application/library can choose to restrict itself to only handling UTF-8 that doesn't contain internal NULs, but the spec itself allows for zero bytes in a string.

[1]: https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

We are in agreement that the only time a single zero byte can be found in well-formed utf-8 is for the NUL character.

By definition, with a null-terminated string, NUL is the terminator.

If you want to have strings that contain NUL, then by definition you can't use a null-terminated string.

This is true of utf-8 or regular C strings.

The point is, if you handle strings the C way, you're not in conformance with UTF-8.

If someone passes you a text file that is verified to be valid UTF-8 and contains, say, access permissions, then you better not stop parsing it at the first '\0' character.

None of this is a huge problem, but it's something to be aware of. C string handling is incompatible with UTF-8.

Unless you have U+0000 there isn't any other sequence of code points that has an 0x00 byte in UTF-8. I don't see this as a huge problem.

If you really do need it there are some C language libraries that use "pascal-ish" structs to do strings. UNICODE_STRING in Windows comes to mind. Doing strings in C doesn't force you to use C strings, it's just the most common thing to do.

No it's not a huge problem, but if you're not aware of it, it could easily lead to a security breach: https://news.ycombinator.com/item?id=13974919
It's the same for ASCII - UTF-8 zero byte is NUL.