Hacker News new | ask | show | jobs
by snissn 3581 days ago
That's cool about utf8 - what downsides are there to not treating utf-8 as raw bytes?
1 comments

The big things are related to string length not matching byte count. strlen() is O(n) because you have to see how many sequences are actually in the string. More than that, splitting/slicing/indexing a string based on byte offsets doesn't work. For a 100-byte ASCII string, you're guaranteed that you can split it into two 50-byte strings and things will still work: you can output them separately, you can get the total length by adding strlen() on each half, you can find a character by doing strchr() on each half, etc. For a 100-byte valid UTF-8 string, splitting it into two 50-byte strings will possibly get you an invalid string, because a character could be split in half. So strlen() (even a UTF-8-correct strlen()) and strchr() don't compose. Outputting a string in two halves works properly as long as the receiver buffers its input, and is willing to wait to reconstruct a partial character.

A related problem is that in older UNIX terminals, pressing backspace would delete one byte, not one character. Newer UNIX kernels have code in the terminal implementation to decode UTF-8 enough to backspace an entire character.

To clarify, Letting the length of a UTF-8 string in Go is O(1); it's computed and stored on the string header at creation.
To clarify even more: that length is the number of bytes (or UTF-8 code units) in the string. It doesn't corresponding to the number of characters (which one may either consider to be Unicode codepoints, or more technically correct, Unicode grapheme clusters).

If you want to count the number of codepoints in a string (called "rune" in Go), then you need to do so explicitly: https://golang.org/pkg/unicode/utf8/#RuneCountInString

Touché