Hacker News new | ask | show | jobs
by _rend 1645 days ago
> Parsing this out of utf-8 encoding requires no knowledge of unicode or even utf-8.

If you have valid UTF-8 already, then yes, the task is a lot easier. But depending on the level at which you're parsing, this might not be the case — i.e., if you're writing a JSON parser from the ground up, you do need to know what UTF-8 and Unicode are, and will need to validate the input data.

> Converting the unicode character escape codes to utf-8 would require knowledge of utf-8 encoding

Agreed. Even if you're not working at the "array-of-bytes" level, you will need to be able to parse and translate "\u..."-style strings into the appropriate output character encoding.

> but this unescaping is not a feature that would be provided by the language regardless.

I'm not sure we're talking about this being handled at the language level. This translation is something that would likely be offered at the parser level (working with the features offered by the standard library), but the parser does need to know about it — and does need to be able to work with strings at a granular level to be able to parse it out. By definition, it cannot leave the input data as an undecoded bag of bytes.

Note, too, that the JSON spec does not specifically require UTF-8. UTF-16 is a completely valid encoding for JSON (though much less common than UTF-8), in which case none of these characters are an ASCII subset, and greater awareness is needed to be able to handle this.

1 comments

> it cannot leave the input data as an undecoded bag of bytes

But all it's doing here is taking a hex string (which is entirely ASCII) and converting it into the respective hex representation. Since ASCII translates unambiguously to bytes, it doesn't really matter if `str[0]` is operating on a byte stream, codepoint stream or grapheme stream, because in utf8, they're all the same thing as long as we're within the ASCII range.

Where things get hairy is stuff like `str.reverse()` over arbitrary strings that may or may not be in ASCII. This repo[0] talks about some of the challenges associated with conflating characters with either bytes or codepoints. The problem is that programming languages often approach strings from the wrong angle: you can't just tack on handling of multi-byte codepoints on top of ascii handling; you lose O(1) random access and you don't actually model the linguistic domain properly by doing so, because in the first place, humans think of characters not in terms of bytes or codepoints, but in terms of grapheme clusters. Clustering correctness falls deep in the realm of linguistics, and is therefore arguably more suitable to be handled by a library than a programming language.

[0] https://github.com/mathiasbynens/esrever

I agree entirely with your second paragraph, but regarding this:

> hex string (which is entirely ASCII)

My point is that JSON doesn't need to be UTF-8 or a superset of ASCII to be valid. It can be any representation of Unicode, including UTF-16, UTF-32, GB 18030, etc.; so long as the text is is comprised of Unicode code points in some Unicode transformation format, the JSON is valid.

As I said in the parent comment: if you are working within UTF-8 exclusively, and can assume valid UTF-8, then great! But this isn't necessarily true, and in some cases, you will still need to care about the encoding.

(Either way, this starts straying slightly from the more general discussion at hand: regardless of the encoding of the string, you will still need an ergonomic way of interacting with the contents of the data in order to meaningfully parse the contents — even past the hurdle of decoding from arbitrary bytes, you still need to manipulate the data reasonably. In some cases, this means working with a buffer of bytes; in others, it makes sense to manipulate the data as a string... In which case, you may run into some of the string manipulation ergonomic considerations being discussed around these comments.)

> JSON doesn't need to be UTF-8 or a superset of ASCII to be valid. It can be any representation of Unicode, including UTF-16, UTF-32, GB 18030, etc

Sure, it can also be gzipped, encrypted, etc but that goes back to the point that there's nothing inherently special about JSON as it relates to encoding to a byte stream. All there is to it is that somewhere in a program there's an encode/decode contract to extract meaning out of the byte stream, and in a protocol one most likely only looks at byte streams as sequences of bytes (because performance-wise, it doesn't make sense to look at payload size in terms of number of codepoints/graphemes at a protocol level)