|
I can't decide if "JSON-superset" is technically accurate or not. JSON's string literals come from JavaScript, and JavaScript only sortof has a Unicode string type. So the \u escape in both languages encodes a UTF-16 code unit, not a code point. That means in JSON, the single code point U+1f4a9 "Pile of Poo" is encoded thusly: "\ud83d\udca9"
JSON specifically says this, too, Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
[… snip …]
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".
Now, Ion's spec says only: U+HHHH \uHHHH 4-digit hexadecimal Unicode code point
But if we take it to mean code point, then if the value is a surrogate… what should happen?Looking at the code, it looks like the above JSON will parse: 1. Main parsing of \u here:
https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L2429-L2434
2. which is called from here, and just appended to a StringBuilder:
https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L1975
My Java isn't that great though, so I'm speculating. But I'm not sure what should happen.This is just one of those things that the first time I saw it in JSON/JS… a part of my brain melted. This is all a technicality, of course, and most JSON values should work just fine. |
Surrogates are code points. The spec does not say what should happen if the surrogate is invalid (for example, if only the first surrogate of a surrogate pair is present), but neither does the JSON spec.
Java internally also represents non-BMP code points using surrogates. So, simply appending the surrogates to the string should yield a valid Java string if the surrogates in the input are valid.