Hacker News new | ask | show | jobs
by lifthrasiir 599 days ago
Here the canonical representation refers to one single definite and unambiguous encoding for given data. This requirement is very common in cryptographic applications and also commonly demanded when the deterministic processing is desired. Technically the "canonical" and "deterministic" encoding can differ (e.g. ASN.1 CER vs. DER), but there is not much value to have two distinct encodings.

On graphs: as you've said recursive code has to be careful, but the recursion itself is not very frequent in normal applications and they are more susceptible to attacks if the recursion is built into the serialization format. XML billion laughs attack is a famous demonstration for this issue. It is still worthwhile to have parallel standards to specify how recursive structures should be encoded in the basic format, and possibly to tweak the basic format to better accommodate such standards, but I believe such needs can be met without making the basic format bigger.

1 comments

Is canonilization not irrelevant. Any format with comments is not canonical; so xᴍʟ is not, ᴊꜱᴏɴ has escapes options for string characters so it not either.

Response to graph conjecture stated https://news.ycombinator.com/item?id=42072133.

Almost no serialization format is canonical by default---AFAIK bencode was the sole example that mandates the canonicalization. Instead, a canonical subset of the format is usually defined, which would of course exclude comments, unless comments themselves are considered semantic like XML. Yes, even XML has a canonical subset [1]!

[1] https://www.w3.org/TR/xml-c14n/

There is the concept of “well formed xᴇɴᴏɴ” as outputted by the ᴅᴏᴍ and serializer which is deterministic as will suffice for canonicalization.