Hacker News new | ask | show | jobs
by horseradish7k 311 days ago
you'd have to rescrape them all from https://web.archive.org/cdx/search?url=goo.gl/* - they don't publish the whole datasets
1 comments

No, I meant the .warc.zst files on archive.org that were the result of the ArchiveTeam's work. However, it seems they're under some kind of embargo - which is the first I've ever seen a private link on archive.org
I can see some reasonable arguments for not publishing the full dataset. People undoubtedly shortened lots of links to unlisted videos/documents/pages under the assumption that the short link, like the original link, would be unguessable.
Then why go to the trouble of archiving them, then upload them to a public archive site, only to then keep them secret?

I'm sure pastebin is filled with people's AWS credentials, too, but you don't see them randomly denying access to listings

Because then you can access the archived destination if you already know the short URL. You just can't get a full list of potentially sensitive short URL/destination pairs.
You are aware of which thread you're discussing this in, right? The one where a bunch of like-minded souls enumerated all the address space in a few weeks?

The sibling link above that queries Wayback's warc index shows at least the first several are only 6 alnum wide so it's no wonder the ArchiveTeam got them in reasonable time

Picking one at random, it seems the super sekrit deets you're safeguarding include buyrussia21.co.kr which, yes, is for sure very, very secret

i asked them why they did this. the answer surprisingly is because they fear if they release the full dumps they will get blocked because of the AI scraping wars.
Yeah what they did is probably the best way to handle it.
Tangentially related but I've seen twitter links that used to be on the wayback machine disappear from it at some point, presumably due to personal request from the owner.
Pretty sure you can nuke all your domains old content by blocking archive.org in robots.txt