| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thisisparker 3256 days ago
	ArchiveTeam and URLTeam do great work, but those projects and the OP have different goals. (I'm the OP.) This bot will NOT produce any kind of corpus of tweeted links or links shortened by Twitter's t.co shortener; in fact, it bypasses t.co entirely and I don't have any kind of record of those shortened links. Instead, it backs up the contents of the pages that are linked to at the time of the tweets. Frankly my tool doesn't do anything interesting at all with the URLs — it just submits the "expanded URL" that was tweeted to the Wayback Machine and lets it sort out any and all 301s.

2 comments

gojomo 3256 days ago

It might be nice to save as well:

(1) the tweet-detail page, for the tweet that includes the link;

(2) the t.co mapping, so that the tweet-detail page's t.co link can somehow be resolved to the (archived) page to which it links.

I don't think there are any blocks against doing (1).

Unfortunately for (2), Twitter has a blanket robots.txt prohibition in place for domain t.co. Perhaps IA could be convinced to ignore that robots.txt in the public interest.

Alternatively, perhaps another site could be set up that itself accepts t.co link-paths, in the background queries t.co, and returns both an HTML page and working redirect that isn't robots.txt-blocked. LinkArchiver (and any other similar sites) could as a convention archive responses of this other site whenever they'd like to archive t.co.

cyphar 3256 days ago

Right, but my point was that resolving the 301 and storing the result is more efficient overall than doing a full mirror of a URL only to find out that it's an alternative URL with their tracking stuff appended to the end of the URL.

But yeah, if you want to bypass t.co (meaning that it's not useful for archeologists) it makes sense to just archive them.