| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rspeer 3187 days ago

The use case really speaks to me, but I'm not convinced that decentralization is going to help datasets not to get lost.

I spent a while trying to download recent updates to the Reddit comment corpus [1], which is hosted on BitTorrent. The downloads never seem to finish.

It seems to me that decentralization means that, when a dataset stops being new and exciting, it will disappear. How will Dat counter this?

[1] https://www.reddit.com/r/datasets/comments/65o7py/updated_re...

3 comments

yoshuaw 3187 days ago

Because Dat is just a protocol, decentralization is a choice. For quick, ephemeral exchanges direct P2P works brilliantly. For longer lived data sets, sharing it with a (commercial) mirror might make sense. Or perhaps you host it yourself. The beauty is that you, as a user of the protocol, get to decide what works best for you.

link

filiwickers 3187 days ago

We have a few approaches to the disappearing data.

First, we are working with libraries, universities, or other groups with large amounts of storage/bandwidth. They'd help provide hosting for datasets used inside their institutes or other essential datasets.

Second, we started to work on at-home data hosting with Project Svalbard[1]. This is kind of a SETI@home idea where people could donate server space at home to help backup "unhealthy" data (data that doesn't have many peers).

Finally, for "published" data (such as data on Zenodo or Dataverse), we can use those sites as a permanent HTTP peer. So if no data is available over p2p sites then you can get it directly from the published source.

As others said, decentralization is an approach but not a solution. It gives you the flexibility to centralize or distribute data as necessary without being tied to a specific service. But we still need to solve the problem!

[1] https://medium.com/@maxogden/project-svalbard-a-metadata-vau...

link

tbv 3187 days ago

That’s something we think about a lot, and decentralization isn’t a silver bullet solution to data loss, but I do think it’s more resilient than what we typically do now.

To counter that, you can take measures to mirror important datasets with a dedicated peer. It requires effort, but it at least makes it much, much harder for example, for a government agency to take down public data without warning.

link