| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by duskwuff 3351 days ago
	There's some value in allowing site operators to retroactively remove content which was never intended to be public. A common and unfortunate example is backups (like SQL dumps) being stored in web-accessible directories, then subseqently being indexed and archived when a crawler finds the appropriate directory index. What needs to be fixed first is just the really common case mentioned in the blog post, where a domain changes ownership and a restrictive robots.txt is applied to the parking page.

2 comments

Spare_account 3351 days ago

Here's a slight modification to the GP proposal:

- Respect robots.txt at the time you crawl it.

- If robots.txt appears later, stop archiving from that date forwards.

- Preserve access to old archived copies of the site by default.

- Offer a mechanism that allows a proven site owner to explicitly request retrospective access removal.

If archive.org have recorded the date that they first observed a robots.txt on the sites currently unavailable, they could even consider applying the above logic today retrospectively. Perhaps after a couple of warning emails to the current Administrative Contact for the domain.

link

pbhjpbhj 3351 days ago

>mechanism that allows a proven site owner to explicitly request retrospective access removal. //

It should be "a proven content owner", just buying a site shouldn't allow someone to remove it from archive.

link

ss64 3351 days ago

How about you respect the robots.txt until the IP address where it is hosted changes. Once the IP has changed, then any new robots.txt exclusions apply only to the new pages not the archived pages under the old IP, which continue respecting the old archived robots.txt.

The IP address changing is a pretty solid indicator that control of that content has moved to a new organisation. Note this does not always coincide with the domain name owner changing.

A scenario that I can imagine becoming litigious: company owns a domain for promoting some product and they use robots.txt to prevent copies. The product reaches end of life and domain is allowed to expire. Someone else buys the domain and starts hosting content with no robots restriction. Archive.org start to display pages from the old company. Company then sues archive.org for copyright violation.

link