Hacker News new | ask | show | jobs
by Fannon 598 days ago
Thanks for writing this up! Sounds like you really dig deeply in to this.

Not entirely sure what you mean with canonical representation (I've heard this in the context of JSON-LD before, though). Can you explain what you mean here?

Where do you see the problem with Graphs and Dos? A reference is just a pointer. You just have to be careful when doing recursive code. I actually like the idea to explicitly define how a reference / association is made, because otherwise people will have to re-invent ID and association concepts and there's no shared understanding. In JSON Schema, you cannot properly express an association or graph structure and people start using overloaded and not well-defined concepts like `$ref` which is a separate standard.

1 comments

Here the canonical representation refers to one single definite and unambiguous encoding for given data. This requirement is very common in cryptographic applications and also commonly demanded when the deterministic processing is desired. Technically the "canonical" and "deterministic" encoding can differ (e.g. ASN.1 CER vs. DER), but there is not much value to have two distinct encodings.

On graphs: as you've said recursive code has to be careful, but the recursion itself is not very frequent in normal applications and they are more susceptible to attacks if the recursion is built into the serialization format. XML billion laughs attack is a famous demonstration for this issue. It is still worthwhile to have parallel standards to specify how recursive structures should be encoded in the basic format, and possibly to tweak the basic format to better accommodate such standards, but I believe such needs can be met without making the basic format bigger.

Is canonilization not irrelevant. Any format with comments is not canonical; so xᴍʟ is not, ᴊꜱᴏɴ has escapes options for string characters so it not either.

Response to graph conjecture stated https://news.ycombinator.com/item?id=42072133.

Almost no serialization format is canonical by default---AFAIK bencode was the sole example that mandates the canonicalization. Instead, a canonical subset of the format is usually defined, which would of course exclude comments, unless comments themselves are considered semantic like XML. Yes, even XML has a canonical subset [1]!

[1] https://www.w3.org/TR/xml-c14n/

There is the concept of “well formed xᴇɴᴏɴ” as outputted by the ᴅᴏᴍ and serializer which is deterministic as will suffice for canonicalization.