| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by horseradish7k 311 days ago
	you'd have to rescrape them all from https://web.archive.org/cdx/search?url=goo.gl/* - they don't publish the whole datasets

1 comments

mdaniel 311 days ago

No, I meant the .warc.zst files on archive.org that were the result of the ArchiveTeam's work. However, it seems they're under some kind of embargo - which is the first I've ever seen a private link on archive.org

link

rafram 311 days ago

I can see some reasonable arguments for not publishing the full dataset. People undoubtedly shortened lots of links to unlisted videos/documents/pages under the assumption that the short link, like the original link, would be unguessable.

link

mdaniel 311 days ago

Then why go to the trouble of archiving them, then upload them to a public archive site, only to then keep them secret?

I'm sure pastebin is filled with people's AWS credentials, too, but you don't see them randomly denying access to listings

link

rafram 311 days ago

Because then you can access the archived destination if you already know the short URL. You just can't get a full list of potentially sensitive short URL/destination pairs.

link

mdaniel 311 days ago

You are aware of which thread you're discussing this in, right? The one where a bunch of like-minded souls enumerated all the address space in a few weeks?

The sibling link above that queries Wayback's warc index shows at least the first several are only 6 alnum wide so it's no wonder the ArchiveTeam got them in reasonable time

Picking one at random, it seems the super sekrit deets you're safeguarding include buyrussia21.co.kr which, yes, is for sure very, very secret

link

brokensegue 311 days ago

i asked them why they did this. the answer surprisingly is because they fear if they release the full dumps they will get blocked because of the AI scraping wars.

link

yreg 310 days ago

Yeah what they did is probably the best way to handle it.

link

viliml 311 days ago

Tangentially related but I've seen twitter links that used to be on the wayback machine disappear from it at some point, presumably due to personal request from the owner.

link

corobo 311 days ago

Pretty sure you can nuke all your domains old content by blocking archive.org in robots.txt

link