|
|
|
|
|
by Spex_guy
1645 days ago
|
|
Parsing this out of utf-8 encoding requires no knowledge of unicode or even utf-8. All of the relevant characters (reverse solidus, quotation mark, and control characters) are single byte characters in the ascii subset. These characters cannot be found inside multi-byte characters in utf-8 due to the design of the encoding. Converting the unicode character escape codes to utf-8 would require knowledge of utf-8 encoding, but this unescaping is not a feature that would be provided by the language regardless. |
|
If you have valid UTF-8 already, then yes, the task is a lot easier. But depending on the level at which you're parsing, this might not be the case — i.e., if you're writing a JSON parser from the ground up, you do need to know what UTF-8 and Unicode are, and will need to validate the input data.
> Converting the unicode character escape codes to utf-8 would require knowledge of utf-8 encoding
Agreed. Even if you're not working at the "array-of-bytes" level, you will need to be able to parse and translate "\u..."-style strings into the appropriate output character encoding.
> but this unescaping is not a feature that would be provided by the language regardless.
I'm not sure we're talking about this being handled at the language level. This translation is something that would likely be offered at the parser level (working with the features offered by the standard library), but the parser does need to know about it — and does need to be able to work with strings at a granular level to be able to parse it out. By definition, it cannot leave the input data as an undecoded bag of bytes.
Note, too, that the JSON spec does not specifically require UTF-8. UTF-16 is a completely valid encoding for JSON (though much less common than UTF-8), in which case none of these characters are an ASCII subset, and greater awareness is needed to be able to handle this.