Hacker News new | ask | show | jobs
by deathanatos 3715 days ago
I can't decide if "JSON-superset" is technically accurate or not.

JSON's string literals come from JavaScript, and JavaScript only sortof has a Unicode string type. So the \u escape in both languages encodes a UTF-16 code unit, not a code point. That means in JSON, the single code point U+1f4a9 "Pile of Poo" is encoded thusly:

    "\ud83d\udca9"
JSON specifically says this, too,

   Any character may be escaped.  If the character is in the Basic
   Multilingual Plane (U+0000 through U+FFFF), then it may be
   represented as a six-character sequence: a reverse solidus, followed
   by the lowercase letter u, followed by four hexadecimal digits that
   encode the character's code point.  The hexadecimal letters A though
   F can be upper or lowercase.  So, for example, a string containing
   only a single reverse solidus character may be represented as
   "\u005C".

   [… snip …]

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "\uD834\uDD1E".
Now, Ion's spec says only:

   U+HHHH	\uHHHH	4-digit hexadecimal Unicode code point
But if we take it to mean code point, then if the value is a surrogate… what should happen?

Looking at the code, it looks like the above JSON will parse:

  1. Main parsing of \u here:
     https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L2429-L2434

  2. which is called from here, and just appended to a StringBuilder:
     https://github.com/amznlabs/ion-java/blob/1ca3cbe249848517fc6d91394bb493383d69eb61/src/software/amazon/ion/impl/IonReaderTextRawTokensX.java#L1975
My Java isn't that great though, so I'm speculating. But I'm not sure what should happen.

This is just one of those things that the first time I saw it in JSON/JS… a part of my brain melted. This is all a technicality, of course, and most JSON values should work just fine.

1 comments

> But if we take it to mean code point, then if the value is a surrogate… what should happen?

Surrogates are code points. The spec does not say what should happen if the surrogate is invalid (for example, if only the first surrogate of a surrogate pair is present), but neither does the JSON spec.

Java internally also represents non-BMP code points using surrogates. So, simply appending the surrogates to the string should yield a valid Java string if the surrogates in the input are valid.