Hacker News new | ask | show | jobs
by yawnxyz 655 days ago
It's so clever to just pull from Wayback Machine rather than scrape the site itself. Never even thought of that
2 comments

Before building an app that depends on the Wayback Machine (or other Archive infrastructure) it's good to keep in mind this post from their blog: <https://blog.archive.org/2023/05/29/let-us-serve-you-but-don...>

One of my favorite tricks when coming across a blog with a longtail of past posts is to verify that it's hosted on WordPress and then to ingest the archives into my feedreader.

Once you have the WordPress feed URL, you can slurp it all in by appending `?paged=n` (or `&paged=n`) for the nth page of the feed. (This is a little tedious in Thunderbird; up till now I've generated a list of URLs and dragged and dropped each one into the subscribe-to-feed dialog. The whole process is amenable to scripting by bookmarklet, though—gesture at a blog with the appropriate metadata, and then get a file that's one big RSS/Atom container with every blog post.)

wait, so if WordPress is migrating 500M blogs to Wordpress[1], does this mean essentially we'll have easy access to all tumblr blogs' history?

[1] https://arstechnica.com/gadgets/2024/08/tumblr-migrates-more...

I used it to recover some lost content from my blog a few years ago, it was fantastic: https://simonwillison.net/2017/Oct/8/missing-content/