|
What documents (books, scientific articles) benefit from specifically is a number of highly consistent, highly accurate identifiers: DOI (scientific articles), ISBN (published books), and others (OCLC identifier, Library of Congress Catalogue Number, etc.) With the addition of hashsums (even MD5 ad SHA1, though longer and more robust hashsums are preferred), a pretty reliable archive of content can be made. It's a curious case where increased legibility seems to be breaking rather than creating a gatekeeper monopoly. I've been interested in the notion of more reliable content-based identifiers or fingerprints themselves, though I've found little reliable reference on this. Ngram tuples of 4-5 words are often sufficient to identify a work, particularly if a selection of several are made. Aggreeing on which tuples to use, how many, and how to account for potential noise / variations (special characters, whitespace variance, OCR inaccuracy) is also a stumbling point. |