| > Parsing this out of utf-8 encoding requires no knowledge of unicode or even utf-8. If you have valid UTF-8 already, then yes, the task is a lot easier. But depending on the level at which you're parsing, this might not be the case — i.e., if you're writing a JSON parser from the ground up, you do need to know what UTF-8 and Unicode are, and will need to validate the input data. > Converting the unicode character escape codes to utf-8 would require knowledge of utf-8 encoding Agreed. Even if you're not working at the "array-of-bytes" level, you will need to be able to parse and translate "\u..."-style strings into the appropriate output character encoding. > but this unescaping is not a feature that would be provided by the language regardless. I'm not sure we're talking about this being handled at the language level. This translation is something that would likely be offered at the parser level (working with the features offered by the standard library), but the parser does need to know about it — and does need to be able to work with strings at a granular level to be able to parse it out. By definition, it cannot leave the input data as an undecoded bag of bytes. Note, too, that the JSON spec does not specifically require UTF-8. UTF-16 is a completely valid encoding for JSON (though much less common than UTF-8), in which case none of these characters are an ASCII subset, and greater awareness is needed to be able to handle this. |
But all it's doing here is taking a hex string (which is entirely ASCII) and converting it into the respective hex representation. Since ASCII translates unambiguously to bytes, it doesn't really matter if `str[0]` is operating on a byte stream, codepoint stream or grapheme stream, because in utf8, they're all the same thing as long as we're within the ASCII range.
Where things get hairy is stuff like `str.reverse()` over arbitrary strings that may or may not be in ASCII. This repo[0] talks about some of the challenges associated with conflating characters with either bytes or codepoints. The problem is that programming languages often approach strings from the wrong angle: you can't just tack on handling of multi-byte codepoints on top of ascii handling; you lose O(1) random access and you don't actually model the linguistic domain properly by doing so, because in the first place, humans think of characters not in terms of bytes or codepoints, but in terms of grapheme clusters. Clustering correctness falls deep in the realm of linguistics, and is therefore arguably more suitable to be handled by a library than a programming language.
[0] https://github.com/mathiasbynens/esrever