Hacker News new | ask | show | jobs
by derefr 2345 days ago
One nice thing about digital data, as opposed to physical artefacts, is that you don’t need to keep digital data’s metadata attached to the data “at the hip.”

Through the magic of cryptographic hash algorithms, you can just keep your data sets floating around “raw” (like in these torrents), and then, elsewhere, ascribe metadata to the hash of the content it is meant to annotate.

Then, later, you can reassemble them in either order—either by first finding a data set, hashing it, and then looking up metadata in some metadata-hosting service; or by first browsing a catalogue of indexed metadata, finding out about a dataset that meets your needs, and then retrieving the data set by its hash.

Which is to say: with digital data, library science (creating metadata and chains-of-custody and indexing them for search) and archiving (ensuring access to pristine artifacts over time) don’t need to happen at the same time, in the same place. There can be separate “artifact hosting” and “metadata library” services. (Which is especially helpful in contexts where private IP is involved—you can still keep in your metadata library, the metadata for a data-set you don’t have the rights to; and those with the rights can go get the data-set themselves.)

3 comments

  > s that you don’t need to keep digital data’s metadata attached to the data “at the hip.”

You don't have to, but it's still mostly a good idea. But this stuff isn't either-or. We can have both.

This is especially true for research oriented files, where consumers are often unable or unwilling to maintain a functional metadata store, and do a lot of manual file handling. Saying "well, somebody could have set up a super-awesome metadata system that track this" doesn't magically make those resources exist.

This flexibility in time, specialization, and order of operation is surely one of the joys of modern digital collections.

Library scientists might say archiving and structuring and curation are all facets of that science. And you'll also want a hash search engine that finds related hashes, as there can be many revisions + versions, only some of which have some metadata.

Aaaand someone has to do the work for computing the index and annotating the hashes.
I think it's worth recognizing that this is a good first step in a hard problem. Hosting many TB of data for free isn't easy. Building an index on top of that data isn't easy either, and it looks like no such index exists today, but if someone decided to build that index they wouldn't need to worry about the hosting portion of the problem. That's a great starting point.