|
|
|
|
|
by thisisparker
3256 days ago
|
|
ArchiveTeam and URLTeam do great work, but those projects and the OP have different goals. (I'm the OP.) This bot will NOT produce any kind of corpus of tweeted links or links shortened by Twitter's t.co shortener; in fact, it bypasses t.co entirely and I don't have any kind of record of those shortened links. Instead, it backs up the contents of the pages that are linked to at the time of the tweets. Frankly my tool doesn't do anything interesting at all with the URLs — it just submits the "expanded URL" that was tweeted to the Wayback Machine and lets it sort out any and all 301s. |
|
(1) the tweet-detail page, for the tweet that includes the link;
(2) the t.co mapping, so that the tweet-detail page's t.co link can somehow be resolved to the (archived) page to which it links.
I don't think there are any blocks against doing (1).
Unfortunately for (2), Twitter has a blanket robots.txt prohibition in place for domain t.co. Perhaps IA could be convinced to ignore that robots.txt in the public interest.
Alternatively, perhaps another site could be set up that itself accepts t.co link-paths, in the background queries t.co, and returns both an HTML page and working redirect that isn't robots.txt-blocked. LinkArchiver (and any other similar sites) could as a convention archive responses of this other site whenever they'd like to archive t.co.