|
|
|
|
|
by duskwuff
3351 days ago
|
|
There's some value in allowing site operators to retroactively remove content which was never intended to be public. A common and unfortunate example is backups (like SQL dumps) being stored in web-accessible directories, then subseqently being indexed and archived when a crawler finds the appropriate directory index. What needs to be fixed first is just the really common case mentioned in the blog post, where a domain changes ownership and a restrictive robots.txt is applied to the parking page. |
|
- Respect robots.txt at the time you crawl it.
- If robots.txt appears later, stop archiving from that date forwards.
- Preserve access to old archived copies of the site by default.
- Offer a mechanism that allows a proven site owner to explicitly request retrospective access removal.
If archive.org have recorded the date that they first observed a robots.txt on the sites currently unavailable, they could even consider applying the above logic today retrospectively. Perhaps after a couple of warning emails to the current Administrative Contact for the domain.