| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by asveikau 3375 days ago
	Funny. You can parse utf-8 character at a time. Some characters advance the pointer by 4 at an iteration and some less.

3 comments

imron 3374 days ago

You can, and then you get a 4-byte long character 1-byte before the end of your data, you skip over the null-terminator and into the stack, and bang.

Yes, you can avoid this if you're careful and you understand the intricacies of utf-8 (or some other multi-byte encoding), but it very quickly stops being elegant.

link

dbaupp 3375 days ago

What do you mean by "character"? If you mean code point or "unicode scalar value", sure, but if you mean user-visible character (grapheme), it's much more complicated: even something "simple" like ö could be one or two code points.

link

asveikau 3375 days ago

I mean your iterator is char* and you advance it by adding. That's it.

I do NOT mean that char itself corresponds to a glyph or codepoint, you are seriously preaching to the choir making that lecture to me.

link

fauigerzigerk 3375 days ago

>you advance it by adding

And when do you stop? UTF-8 strings can have zero bytes in them so treating them as C strings is potentially error prone depending on the context.

link

imron 3374 days ago

> UTF-8 strings can have zero bytes in them

This is not true. A zero-byte in a utf-8 string is the null-terminator and utf-8 strings can be treated exactly like C strings in terms of where the string ends.

What you do need to look out for is malformed utf-8, for example, 1 byte before the null terminator you get a lead byte saying the next character is 4-bytes long.

If you're not checking each byte for null and just skipping based on the length indicated by the lead byte then you're in for a crash.

Where utf-8 strings differ from C strings is slicing. You can't just slice the string at some random point without doing extra validation to make sure you only slice on codepoint boundaries.

link

dbaupp 3374 days ago

> A zero-byte in a utf-8 string is the null-terminator and utf-8 strings can be treated exactly like C strings in terms of where the string ends.

No, the parent was correct: UTF-8 encodes NUL (i.e. \0) as a single zero byte (e.g. in contrast, Modified UTF-8[1] uses an overlong for NUL, so there's never any possibility of an internal zero). Of course, an application/library can choose to restrict itself to only handling UTF-8 that doesn't contain internal NULs, but the spec itself allows for zero bytes in a string.

[1]: https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

link

imron 3374 days ago

We are in agreement that the only time a single zero byte can be found in well-formed utf-8 is for the NUL character.

By definition, with a null-terminated string, NUL is the terminator.

If you want to have strings that contain NUL, then by definition you can't use a null-terminated string.

This is true of utf-8 or regular C strings.

link

asveikau 3374 days ago

Unless you have U+0000 there isn't any other sequence of code points that has an 0x00 byte in UTF-8. I don't see this as a huge problem.

If you really do need it there are some C language libraries that use "pascal-ish" structs to do strings. UNICODE_STRING in Windows comes to mind. Doing strings in C doesn't force you to use C strings, it's just the most common thing to do.

link

fauigerzigerk 3374 days ago

No it's not a huge problem, but if you're not aware of it, it could easily lead to a security breach: https://news.ycombinator.com/item?id=13974919

link

MichaelGG 3374 days ago

It's the same for ASCII - UTF-8 zero byte is NUL.

link

pikzen 3373 days ago

What are combinators?

Go parse some zalgo with your 4 per iteration algorithm. I'll be there, waiting and laughing.

C string handling is not elegant, nor does it fit the realities of the world.

link