| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dredmorbius 1837 days ago

What documents (books, scientific articles) benefit from specifically is a number of highly consistent, highly accurate identifiers: DOI (scientific articles), ISBN (published books), and others (OCLC identifier, Library of Congress Catalogue Number, etc.)

With the addition of hashsums (even MD5 ad SHA1, though longer and more robust hashsums are preferred), a pretty reliable archive of content can be made. It's a curious case where increased legibility seems to be breaking rather than creating a gatekeeper monopoly.

I've been interested in the notion of more reliable content-based identifiers or fingerprints themselves, though I've found little reliable reference on this. Ngram tuples of 4-5 words are often sufficient to identify a work, particularly if a selection of several are made. Aggreeing on which tuples to use, how many, and how to account for potential noise / variations (special characters, whitespace variance, OCR inaccuracy) is also a stumbling point.

2 comments

spicybright 1837 days ago

Why map anything to words for strict identification? Words and language are very error prone vs an id number or hash.

link

dredmorbius 1837 days ago

It's a bit of an itch I've been scratching for a few years.

Most especially, given two or more instances of what you suspect to be the same or a substantively similar work, how can you assess this in a robust and format-independent manner, programmatically?

For works with well-formed metadata, this isn't an issue.

For identical duplicate copies of the same file, a hash is effective.

But for the circumstance most often encountered in reality --- different forms and formats derived from different sources but containing substantially the same work --- there is no simple solution of which I'm aware. As examples, say you have a reference document The Reference Document.

How do I determine that:

- An ACSCII-only textfile

- Markdown, HTML, DocBook, and LaTeX sources

- PDF, MS Word (which version), PS, DJVU, ePub, or .mobi files (sling any other formats you care to mention).

- Hardbound and paperback physical copies

- Scans made from the same or different physical books or instances, versions, and/or translations.

- Audiobooks based on a work. By the same or different readers.

- Dramatic performances, films, video series, comic-book adaptations, etc., of a work. (Say: Hamlet or Romeo and Juliet. What is the relationship to "West Side Story" (and which version), or Pyramus and Thisbe?)

- Re-typed or OCRed text

... all refer to the same work?

How do you define "work"?

How do you define "differences between works"?

How do you distinguish intentional, accidental, and incidental differences between instances? (Say: translations, errata, corrections, additions for the one, transcription errors for the second, and scanning or rendering artefacts for the third.)

If you're working in an environment in which instances of works come from different sources with different provenances, these questions arise. At least some of these questions are prominent in library science itself. It's the technical mapping of digitised formats I'm focusing on most closely, so the physical instantiations aren't as critical here, though the presumption is that these could be converted to some machine-readable form.

In bibliographic / library science, the term is "work, expression, manifestation"

https://www.loc.gov/marc/marbi/2011/2011-dp03.html

link

jl6 1836 days ago

The general problem here is not solvable with technology if there is no universally agreed definition for “a work” - and there isn’t (this touches on some profound issues of ontology).

And so I suspect the way forward is to maintain human-curated mappings of file hashes to “works”, where “a work” is a matter of the curator’s opinion, and different curations will be valued differently by different consumers of that information. For example, maybe a literary expert would have a great and respected method of organizing the works and derived works of Shakespeare, but that same person might not be sought out for their views on pop songs.

You could probably start with an ML-based curation that gets it 80% right, and fill out the last 20% with gamified crowdsourcing (with multiple competing interpretations of the last 20%).

link

dredmorbius 1836 days ago

Yes, it's complicated.

All analogies melt if they're pushed loudly enough. And all models are wrong, though some are useful.

The notion of a work has utility, it respects the notion of different forms, variations, and evolution with time. If you're looking at, say, multiple editions of a book, or even of something much more dynamic, say, source code or a Wiki entry, yes the specific content may change at any point, and stands through many versions, but those are connected through edit events. A good revision control system will capture much of that, if the history interests you.

Ultimately, I'd argue that "work" is defined in relationships and behaviours. A record intermediates between author(s) and reader(s) (or listeners, viewers, etc.), concerning some informational phenomenon, perhaps fictional, perhaps factual, perhaps itself an action (as in a marriage record, divorce decree, or court decision). The work in its total context matters. (At which point we discover most works have very little context...).

The file-hashes-to-work mapping is all but certain to play a large role, but even that is only a means of indicating a relationship that is established by some other means.

The notion of selecting an arbitrary set of ngram tuples to establish highly probable relationsip is likely to remain at least one of those means.

And yes, the incremental / tuned approach is also likely a useful notion.

Paul Otlet had a lot to say about "documents", though I think "records" is a better term for what he had in mind, as any persistent symbolic artefact: book, painting, music, photograph, film, etc.

link

mathnmusic 1836 days ago

I have been dealing with the same problem for curating resources at https://learnawesome.org. Projects like Openlibrary do collect unique identifiers for _books_, but for everything else, it mostly takes manual effort. For example, I collect talks/podcasts by the author where they discuss ideas from their books. Then there are summaries written by others.

link

dredmorbius 1836 days ago

There's a lot of work toward this in library space, though it takes some adaptation to new media formats. Paul Otlet worked in a paper-only medium in the early 20th century but also has some excellent thinking. His books are now seeing translation from French. The Internet Archive and Library of Congress are also doing a lot of relevant work, see the WARC format as an example.

What's particularly relevant now are ephemeral and/or continuously updated online content --- and not just the WWW (http/https), but other protocols (ftp, gemini, ipfs, torrents, ...), as well as apps.

A working truism I developed was that "identity is search that produces a single result". So if you can come up with something that uniquely identifies a work, then that can be a working identifier. I typically focus on what can be reasonably assessed of author, title, publication date, publisher (traditional, website/domain), and failing that, descriptive text. Remember that originally titles were simply the introductory lines of works (a practice that remains used in some cases, e.g., the names of church masses or prayers, e.g., "Kyrie Eleison").

The Superintendent of Documents (SuDoc) Classification Scheme (used by the US goverment and GAO) and operates by agency, type of publication, and further divisions, as well as date/year. https://www.fdlp.gov/about-fdlp/22-services/929-sudoc-classi...

link

mandelken 1837 days ago

Probably because for written text the words identify the content while the hash relates more to the digital carrier format (pdf vs epub) and id number can change between publications, countries, etc.

link

dredmorbius 1837 days ago

Bingo.

And to drag in metadata, it may:

- Not be present.

- Be inaccurately applied to the correct work (metadata say the work is different, work is in fact related/same).

- Be inaccurately applied to the wrong work (metadata say the works are the same/related, they are not).

link

fnord77 1837 days ago

text to speech the doc then an acoustic fingerprint on the audio :)

link

dredmorbius 1836 days ago

You'd all but certainly be better going in the other direction.

Text is a more constrained state space than speech/audio.

link