Hacker News new | ask | show | jobs
by glofish 2345 days ago
Cool idea, it is impressive that it is still around - alas it is flawed the same way all scientific data is flawed.

There is no metadata - all you have is an awkward imprecise textual search of the abstract that comes with the data. Good luck hosting the world's data that way.

3 comments

One nice thing about digital data, as opposed to physical artefacts, is that you don’t need to keep digital data’s metadata attached to the data “at the hip.”

Through the magic of cryptographic hash algorithms, you can just keep your data sets floating around “raw” (like in these torrents), and then, elsewhere, ascribe metadata to the hash of the content it is meant to annotate.

Then, later, you can reassemble them in either order—either by first finding a data set, hashing it, and then looking up metadata in some metadata-hosting service; or by first browsing a catalogue of indexed metadata, finding out about a dataset that meets your needs, and then retrieving the data set by its hash.

Which is to say: with digital data, library science (creating metadata and chains-of-custody and indexing them for search) and archiving (ensuring access to pristine artifacts over time) don’t need to happen at the same time, in the same place. There can be separate “artifact hosting” and “metadata library” services. (Which is especially helpful in contexts where private IP is involved—you can still keep in your metadata library, the metadata for a data-set you don’t have the rights to; and those with the rights can go get the data-set themselves.)

  > s that you don’t need to keep digital data’s metadata attached to the data “at the hip.”

You don't have to, but it's still mostly a good idea. But this stuff isn't either-or. We can have both.

This is especially true for research oriented files, where consumers are often unable or unwilling to maintain a functional metadata store, and do a lot of manual file handling. Saying "well, somebody could have set up a super-awesome metadata system that track this" doesn't magically make those resources exist.

This flexibility in time, specialization, and order of operation is surely one of the joys of modern digital collections.

Library scientists might say archiving and structuring and curation are all facets of that science. And you'll also want a hash search engine that finds related hashes, as there can be many revisions + versions, only some of which have some metadata.

Aaaand someone has to do the work for computing the index and annotating the hashes.
I think it's worth recognizing that this is a good first step in a hard problem. Hosting many TB of data for free isn't easy. Building an index on top of that data isn't easy either, and it looks like no such index exists today, but if someone decided to build that index they wouldn't need to worry about the hosting portion of the problem. That's a great starting point.
There is metadata. It is stored in bibtex along with every torrent. This format allows it to be a freeform database where the user can add fields as they want. We (Academic Torrents) can then build new ways to display this metadata. Also the "abstract" part of the metadata is rendered as markdown on the details page of a torrent. Here is a good example: https://academictorrents.com/details/d52ccc21455c7a82fd6e589...
Ok, I see that there is code provided there. Better than nothing but geez, it is not really what metadata should be like

  def get_labels(rightside):
    met = {}
    met['brain'] = (
        1. * (rightside != 0).sum() / (rightside == 0).sum())
    met['tumor'] = (
        1. * (rightside > 2).sum() / ((rightside != 0).sum() + 1e-10))
    met['has_enough_brain'] = met['brain'] > 0.30
    met['has_tumor'] = met['tumor'] > 0.01
    return met
I will say that it is very handy to know exactly how the labels were computed.

What I really meant is a way to search and select data based on metadata. For example has_tumor.

Also note how everything is still one single blob, to get one line of any of the files, one would need to download everything.

Bittorrent does support partial downloads that request only some files or byte ranges out of a torrent. Some of the torrents are just compressed zip's but for the others you could look at the code / documentation to see which files were relevant before downloading 10GB of data.

I think the abstract is sufficient for searching data; expecting some kind of smart database that can handle all the weird formats science uses is a bit much.

There are even torrent clients that export a FUSE VFS so you can use your standard tools.
| one would need to download everything

Just download it then. We got mp3 albums off Napster on modems back in the day, surely getting that torrent is easier and faster today.

To err is human, to forgive divine, to fix immortal.