Hacker News new | ask | show | jobs
by tkuhn 4265 days ago
(disclaimer: I am an author of the paper)

Thanks for your comments. First off: yes, most (perhaps all) of the applied methods are not novel, some of them have been around for a long time. We only claim novelty on how these existing methods are combined to solve the problem of data availability and integrity on the web.

Yes, the magnet URI scheme is highly related, and we probably should have referred to it in one way or another. However, there are crucial features that magnet links do not provide (as far as I know): you cannot generate a hash that represents content on a more abstract level than byte sequences (MIME types by themselves don't solve that problem), and you can also not have self-references. All of the features from our list of requirements are supported by some approaches, but (to our knowledge) no approach supports all of them at the same time.

In terms of search engines caching research data, I agree! We shouldn't trust existing providers too much but build a dedicated decentralized infrastructure for scientific purposes (this is what I am working on now).

I am sure the performance measures can be improved (incremental cryptography might allow us to get rid of sorting altogether). The shape of the curve is however not much affected by the fact whether the statements are already sorted or not (they are not sorted for TransformRdf and TransformLargeRdf!).

I hope this clarifies some things.

1 comments

Thanks for your response; it does clarify things.

But, I don't think I understand your concern about abstract hashing and how it would need to be something fundamentally new. Both the order normalization and self-reference are simply preprocessing stages on your data, albeit slightly different forms. The sortedness requirement, I think, is captured by MIME type parameters (the "charset=" in "text/html;charset=UTF-8"), as it does not change the fact that the document is an RDF graph. For the placeholder trick, I think you're right and that you'd want something like a "text/rdf+selfref" MIME type to indicate that it is not in fact valid RDF until preprocessing has been performed. All told, your RDF module would be described in MIME as something like "text/rdf+selfref;sorted=".

Right, I guess you could define everything into a new MIME type, but I think that would be quite a weird thing to do and wouldn't really be faithful to the idea of MIME types. This MIME type would stand for a type that nobody would be directly using for files, but it would only stand for some internal intermediate representation (I will not be able to convince people using RDF to switch to my new strange format instead of TriG or N-Quads!). And that means that there would be two MIME types involved for a single file: the actual type (such as application/rdf+xml or application/trig) and then the type for normalization and hash calculation (something like "text/rdf+selfref;sorted="). I think this shows that MIME types are not a straightforward solution to the given problem and I think this justifies to introduce this new level and a new scheme for the trusty URI modules (e.g. "RA").