| It's a bit of an itch I've been scratching for a few years. Most especially, given two or more instances of what you suspect to be the same or a substantively similar work, how can you assess this in a robust and format-independent manner, programmatically? For works with well-formed metadata, this isn't an issue. For identical duplicate copies of the same file, a hash is effective. But for the circumstance most often encountered in reality --- different forms and formats derived from different sources but containing substantially the same work --- there is no simple solution of which I'm aware. As examples, say you have a reference document The Reference Document. How do I determine that: - An ACSCII-only textfile - Markdown, HTML, DocBook, and LaTeX sources - PDF, MS Word (which version), PS, DJVU, ePub, or .mobi files (sling any other formats you care to mention). - Hardbound and paperback physical copies - Scans made from the same or different physical books or instances, versions, and/or translations. - Audiobooks based on a work. By the same or different readers. - Dramatic performances, films, video series, comic-book adaptations, etc., of a work. (Say: Hamlet or Romeo and Juliet. What is the relationship to "West Side Story" (and which version), or Pyramus and Thisbe?) - Re-typed or OCRed text ... all refer to the same work? How do you define "work"? How do you define "differences between works"? How do you distinguish intentional, accidental, and incidental differences between instances? (Say: translations, errata, corrections, additions for the one, transcription errors for the second, and scanning or rendering artefacts for the third.) If you're working in an environment in which instances of works come from different sources with different provenances, these questions arise. At least some of these questions are prominent in library science itself. It's the technical mapping of digitised formats I'm focusing on most closely, so the physical instantiations aren't as critical here, though the presumption is that these could be converted to some machine-readable form. In bibliographic / library science, the term is "work, expression, manifestation" https://www.loc.gov/marc/marbi/2011/2011-dp03.html |
And so I suspect the way forward is to maintain human-curated mappings of file hashes to “works”, where “a work” is a matter of the curator’s opinion, and different curations will be valued differently by different consumers of that information. For example, maybe a literary expert would have a great and respected method of organizing the works and derived works of Shakespeare, but that same person might not be sought out for their views on pop songs.
You could probably start with an ML-based curation that gets it 80% right, and fill out the last 20% with gamified crowdsourcing (with multiple competing interpretations of the last 20%).