Hacker News new | ask | show | jobs
by a_m_kelly 5712 days ago
I'd love to know what sort of infrastructure you're running this on. I'm in a course for my Masters for library school that deals with similar sorts of problems in maintaining and preserving for the "long term" digital materials.

From your site it sounds like you wrote a script using wget to harvest the files and another to check them against versions that were still up. What do you do on the server end now to ensure that the files are still working correctly? Are you running periodic checksums on them or the like? Finally, are you looking for any help from an interested novice?

1 comments

I have a large database table that stores the md5 hashes of all the files and there is a script that can compare all of the contents of the site with the hashes in the files (and with a second copy if that's what it would come to).

Some bitrot is inevitable but I think it's under control for now.

As for help, yes, but right now I'm pretty swamped in other stuff, the next round of work on reocities will likely come after the new year.

Have you considered using something like MogileFS? It'd be perfect for this sort of situation.

Let me know if you're interested in this or have any questions -- I've dealt a good bit with systems like this in the past, and would love to give you a hand.

You know where to find me :)

And yes, of course I'm interested. But right now no time.