|
|
|
|
|
by dingaling
3351 days ago
|
|
It is great news in general, but seems to be done in a clumsy and counterproductive manner that may cause the Internet Archive to be banned from crawling some websites. The problem: when robots.txt for a website is found to have been made more restrictive, the IA retrospectively applies its new restrictions to already-archived pages and hides them from view. This can also cause entire domains to vanish into the deep-archive. No-one outside IA thinks this is sensible. Their solution: ignore robots.txt altogether. What? That will just annoy many website operators. My proposed solution: keep parsing robots.txt on each crawl and obey it progressively, without applying the changes to existing archived material. This is actually less work than what they currently do. If the new robots.txt says to ignore about_iphone.html you just do that and ignore it. Older versions aren't affected. Basically they're switching from being excessively obedient to completely ignoring robots.txt in order to fix a self-made problem. I can only see that antagonising operators. |
|