I've yet to see a convincing use-case to canonicalize RDF graphs.
The document cites these use-cases:
There are different use cases where graph or dataset canonicalization are important:
Determining if one serialization is isomorphic to another.
Digital signing of graphs (datasets) independent of serialization or format.
Comparing two graphs (datasets) to find differences.
Communicating change sets when remotely updating an RDF source.
These are not real-world use-cases. Why would one want to sign independent of serialization or format? The real-world need is that people start signing graphs. But why would they sign some abstract format that is independent of serialization format? That supposedly independent format is a format too and will have competition soon. It's the way of the world: fork, fork fork.
I'm signing my RDF graphs as bytearrays with PGP and avoid all the hassle.
I assume that serialization formats might reference this standard so that they don’t need to reinvent the wheel that is graph normalization.
> A canonicalization algorithm is necessary, but not necessarily sufficient, to handle many of these use cases.
It’s kind of like how there’s a standard for structured copy of JS objects that gets used for things like the web worker spec.
Signing something independent of serialization might be useful since then the exact serialization format can vary. For example, maybe the data is already serialized using SQLite. I’d prefer to avoid loading the data into memory and reserializing it just to check the signature. Instead, it’d be nice to just canonicalize it and then utilize the indexing capabilities of SQLite to minimize memory usage.
So the use-case is a to a very tentative optimization. This tentative optimization is achieved by introducing a very complicated algorithm that is not guaranteed to run in finite time.
You could also check signatures when loading the data and keep the original bytearray separately in slow/cheap storage.
That way you can sign RDF graphs like you sign any bytearray and keep a simple design.
I used RDF canonicalization in a system that built a computation graph system where the inputs and outputs to a computation were one or multiple RDF graphs.
Many of the computations were doing things like inference that created new blank nodes, and were also doing so in a non-determinstic order, and at the same time many computations created structurally identical outputs (with a low cardinality of triples). By using RDF canonicalization as the basis for content addressing those small graphs, it became quite easy to avoid re-doing a lot of the computations that would have happened due to non-deterministic order. For larger graphs we just used a hash of the native serialization, as re-doing the computation was cheaper than trying to canonicalize.
Adding that canonicalization-based system gave the whole system a significant performance boost, so yeah, there are some scenarios where you "would want to cope with that".
https://www.w3.org/TR/2023/CR-rdf-canon-20231031/#how-to-rea...