| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by imron 3371 days ago

We are in agreement that the only time a single zero byte can be found in well-formed utf-8 is for the NUL character.

By definition, with a null-terminated string, NUL is the terminator.

If you want to have strings that contain NUL, then by definition you can't use a null-terminated string.

This is true of utf-8 or regular C strings.

1 comments

fauigerzigerk 3371 days ago

The point is, if you handle strings the C way, you're not in conformance with UTF-8.

If someone passes you a text file that is verified to be valid UTF-8 and contains, say, access permissions, then you better not stop parsing it at the first '\0' character.

None of this is a huge problem, but it's something to be aware of. C string handling is incompatible with UTF-8.

link

imron 3371 days ago

File processing and string processing are not the same. If you have a file that has a specific data format outside of the encoding, and that format includes NUL bytes as part of the data, then obviously process the file based on that format.

That's separate from string handling.

UTF-8 was originally designed to be compatible with NUL terminated strings and keep NULs out of well formed text.

In fact it was the first point in the 'Criteria for the Transformation Format', mentioned in the initial proposal for utf8.

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

link

fauigerzigerk 3371 days ago

>File processing and string processing are not the same

The UTF-8 spec doesn't make that distinction as far as I know. There's a simple fact: A valid UTF-8 byte sequence can contain nul characters. So you can't naively use C string handling functions on it. And as someone else has correctly pointed out, the same is true for ASCII.

I'm just pointing out a potential pitfall and a source of security issues. Some might assume that after validating UTF-8 text input, you could just dump it in a C string and process it using C's string functions. But that's not the case.

link