Hacker News new | ask | show | jobs
by thisisparker 3256 days ago
ArchiveTeam and URLTeam do great work, but those projects and the OP have different goals. (I'm the OP.) This bot will NOT produce any kind of corpus of tweeted links or links shortened by Twitter's t.co shortener; in fact, it bypasses t.co entirely and I don't have any kind of record of those shortened links.

Instead, it backs up the contents of the pages that are linked to at the time of the tweets. Frankly my tool doesn't do anything interesting at all with the URLs — it just submits the "expanded URL" that was tweeted to the Wayback Machine and lets it sort out any and all 301s.

2 comments

It might be nice to save as well:

(1) the tweet-detail page, for the tweet that includes the link;

(2) the t.co mapping, so that the tweet-detail page's t.co link can somehow be resolved to the (archived) page to which it links.

I don't think there are any blocks against doing (1).

Unfortunately for (2), Twitter has a blanket robots.txt prohibition in place for domain t.co. Perhaps IA could be convinced to ignore that robots.txt in the public interest.

Alternatively, perhaps another site could be set up that itself accepts t.co link-paths, in the background queries t.co, and returns both an HTML page and working redirect that isn't robots.txt-blocked. LinkArchiver (and any other similar sites) could as a convention archive responses of this other site whenever they'd like to archive t.co.

Right, but my point was that resolving the 301 and storing the result is more efficient overall than doing a full mirror of a URL only to find out that it's an alternative URL with their tracking stuff appended to the end of the URL.

But yeah, if you want to bypass t.co (meaning that it's not useful for archeologists) it makes sense to just archive them.