Hacker News new | ask | show | jobs
by ersii 4840 days ago
The problem is that Posterous is hard to crawl. For one; They'll continously and automatedly ban your IPs, even if you rotate over a lot of them. Two: Posterous can't take all of the requests.

We've (ArchiveTeam) unfortunally made Posterous unresponsive multiple times. So please be careful to not completely bring it down if you're doing a solo effort.

Please also bear in mind that it's not just to "chuck it into the downloader"..

1 comments

Also, please use a sensible format if you're crawling/archiving this.

We're using WARC (Web Archive) which is an official ISO File Format standard - which the Internet Archive's Wayback Machine can use. It's also a pretty good and nice format for archiving web pages in general.