Hacker News new | ask | show | jobs
by asveikau 3375 days ago
Funny.

You can parse utf-8 character at a time. Some characters advance the pointer by 4 at an iteration and some less.

3 comments

You can, and then you get a 4-byte long character 1-byte before the end of your data, you skip over the null-terminator and into the stack, and bang.

Yes, you can avoid this if you're careful and you understand the intricacies of utf-8 (or some other multi-byte encoding), but it very quickly stops being elegant.

What do you mean by "character"? If you mean code point or "unicode scalar value", sure, but if you mean user-visible character (grapheme), it's much more complicated: even something "simple" like รถ could be one or two code points.
I mean your iterator is char* and you advance it by adding. That's it.

I do NOT mean that char itself corresponds to a glyph or codepoint, you are seriously preaching to the choir making that lecture to me.

>you advance it by adding

And when do you stop? UTF-8 strings can have zero bytes in them so treating them as C strings is potentially error prone depending on the context.

> UTF-8 strings can have zero bytes in them

This is not true. A zero-byte in a utf-8 string is the null-terminator and utf-8 strings can be treated exactly like C strings in terms of where the string ends.

What you do need to look out for is malformed utf-8, for example, 1 byte before the null terminator you get a lead byte saying the next character is 4-bytes long.

If you're not checking each byte for null and just skipping based on the length indicated by the lead byte then you're in for a crash.

Where utf-8 strings differ from C strings is slicing. You can't just slice the string at some random point without doing extra validation to make sure you only slice on codepoint boundaries.

> A zero-byte in a utf-8 string is the null-terminator and utf-8 strings can be treated exactly like C strings in terms of where the string ends.

No, the parent was correct: UTF-8 encodes NUL (i.e. \0) as a single zero byte (e.g. in contrast, Modified UTF-8[1] uses an overlong for NUL, so there's never any possibility of an internal zero). Of course, an application/library can choose to restrict itself to only handling UTF-8 that doesn't contain internal NULs, but the spec itself allows for zero bytes in a string.

[1]: https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

We are in agreement that the only time a single zero byte can be found in well-formed utf-8 is for the NUL character.

By definition, with a null-terminated string, NUL is the terminator.

If you want to have strings that contain NUL, then by definition you can't use a null-terminated string.

This is true of utf-8 or regular C strings.

Unless you have U+0000 there isn't any other sequence of code points that has an 0x00 byte in UTF-8. I don't see this as a huge problem.

If you really do need it there are some C language libraries that use "pascal-ish" structs to do strings. UNICODE_STRING in Windows comes to mind. Doing strings in C doesn't force you to use C strings, it's just the most common thing to do.

No it's not a huge problem, but if you're not aware of it, it could easily lead to a security breach: https://news.ycombinator.com/item?id=13974919
It's the same for ASCII - UTF-8 zero byte is NUL.
What are combinators?

Go parse some zalgo with your 4 per iteration algorithm. I'll be there, waiting and laughing.

C string handling is not elegant, nor does it fit the realities of the world.