Hacker News new | ask | show | jobs
by alarge 2031 days ago
The issue isn't whether or not your character encoding is always a multiple of 8 bits. It is whether or not you can use standard (octet-focused) parsing functions to deal with those strings. This is what makes utf-8 "special". No byte of a utf-8 multibyte sequence will ever have a value < 127. So for most "syntactic" parsing problems, you can use standard C functions to deal with utf-8 strings - something that is not true with most other multibyte character encodings.
2 comments

UTF-8 has an even stronger guarantee: If a byte sequence at any position in a UTF-8 string matches the byte sequence of a UTF-8 encoding of a Unicode code point then that part of the string represents that code point. This means you cannot just use standard C functions like strchr with UTF-8 strings and ASCII characters but you can alos use e.g. strstr to find UTF-8 substrings in UTF-8 strings.
Bytes are bytes. We’re not debating whether it’s easier to write a UTF-8 decoder; I’m asserting that (almost?) any data can be represented as a sequence of bytes and UTF-8 is not special in that regard.