| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kmeisthax 1955 days ago

The problem is that opt-in archival is effectively no archival.

If you looked at how data is lost online, you'd find out that by far the most obvious and immediate cause of data loss was third-party intermediaries shutting down old sites. For example, when Yahoo! decided to burn down Geocities - and shittons of early Internet history - on the basis of saving some hosting and storage costs. A second cause is neglect; say, you move your blog to a new platform, but you don't bother redirecting the old URLs, so now you've just carved a hundred or so new dead links into other people's content. In both of those cases, data wasn't being deliberately removed because it needed to go - it just happened by accident.

Furthermore, we don't usually know when these accidents happen. Yes, occasionally some intermediary announces it in advance, but often times linked sites just die. This is a phenomenon known as link rot, and it can happen for a whole host of accidental reasons. Try going to a 10 year old news article and clicking on any of the links, or going to a decades-old forum thread and looking at any of the images. Count how many of them still work. It's depressingly low.

Now, try to imagine getting consent to archive in advance from people who do not know or care about the problems I've just mentioned. You won't get very far - not because the archiving is harmful to them but because most people do not understand the problem. People don't backup their own shit until after they've already lost heaps of data. Even in situations like the Geocities shutdown, the logistics of actually asking for permission to archive on top of running a bunch of scrapers to actually do the archival is... complicated. You could do it, but the vast majority of data would go unarchived purely due to an inability to locate the owner of the data.