| HN Mirror

The big things are related to string length not matching byte count. strlen() is O(n) because you have to see how many sequences are actually in the string. More than that, splitting/slicing/indexing a string based on byte offsets doesn't work. For a 100-byte ASCII string, you're guaranteed that you can split it into two 50-byte strings and things will still work: you can output them separately, you can get the total length by adding strlen() on each half, you can find a character by doing strchr() on each half, etc. For a 100-byte valid UTF-8 string, splitting it into two 50-byte strings will possibly get you an invalid string, because a character could be split in half. So strlen() (even a UTF-8-correct strlen()) and strchr() don't compose. Outputting a string in two halves works properly as long as the receiver buffers its input, and is willing to wait to reconstruct a partial character.

A related problem is that in older UNIX terminals, pressing backspace would delete one byte, not one character. Newer UNIX kernels have code in the terminal implementation to decode UTF-8 enough to backspace an entire character.