Hacker News new | ask | show | jobs
by debugnik 957 days ago
Or take the SQLite route, if you don't need to reject strictly invalid JSON5. When they extended SQLite's JSON parser to support JSON5, they specifically relaxed the definition of unquoted keys (in a compatible way) to avoid Unicode tables[1]:

> Strict JSON5 requires that unquoted object keys must be ECMAScript 5.1 IdentifierNames. But large unicode tables and lots of code is required in order to determine whether or not a key is an ECMAScript 5.1 IdentifierName. For this reason, SQLite allows object keys to include any unicode characters greater than U+007f that are not whitespace characters. This relaxed definition of "identifier" greatly simplifies the implementation and allows the JSON parser to be smaller and run faster.

[1]: https://www.sqlite.org/json1.html

1 comments

Or take the XML route. Unicode offers several strategies to avoid continuously updating the Unicode table [1] which has been adapted by XML 1.1 and also later editions of XML 1.0. The actual syntax can be as simple as the following (based from XML, converted to ABNF as in RFC 8295):

    name = name-start-char *name-char
    name-start-char =
        %x41-5A / %x5F / %x61-7A / %xC0-D6 / %xD8-F6 / %xF8-2FF / %x370-37D /
        %x37F-1FFF / %x200C-200D / %x2070-218F / %x2C00-2FEF / %x3001-D7FF /
        %xF900-FDCF / %xFDF0-FFFD / %x10000-EFFFF
    name-char = name-start-char / %x30-39 / %xB7 / %x300-36F / %x203F-2040
In my opinion, this approach is so effective that I believe every programming language willing to support Unicode identifiers but nothing more complex (e.g. case folding or normalization or confusable detections) should use this XML-based syntax. You don't even need to narrow it down because Unicode explicitly avoided identifier characters outsides of those ranges due to the very existence of XML identifiers!

[1] https://unicode.org/reports/tr31/#Immutable_Identifier_Synta...