| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mrkramer 817 days ago
	Internet Archive's crawler is not respecting robots.txt because they want to archive everything not just parts of the Web. But if you are actively breaking robots.txt then your crawler will have a bad reputation and you will have an army of webmasters trying to block your crawler by any means. You can see crawling requests in your sever logs, that's how you know if they are respecting it or not. Imo, they best solution would be to license your content so crawlers pay a fee for crawling and using your content.

1 comments

nerdjon 817 days ago

Well TIL that IA does not respect robots.txt.

Does IA themselves block crawlers? It doesn't look like it according to their robots.txt, even going so far as to say "Please crawl our files."

What would stop an actor from maliciously complying with a robots.txt file by just going to the internet archive instead.

link

mrkramer 817 days ago

>Well TIL that IA does not respect robots.txt.

At least, that's what they say[0].

>What would stop an actor from maliciously complying with a robots.txt file by just going to the internet archive instead.

Nothing; as far as I understand scraping public web is legal or that's what courts are saying lately. Btw, it's mind boggling to be me that after 30 years of commercial Internet and Web, we still don't have a definite answer is scraping of public websites and public web content legal or illegal.

[0] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

link

nerdjon 817 days ago

> Nothing; as far as I understand scraping public web is legal or that's what courts are saying lately. Btw, it's mind boggling to be me that after 30 years of commercial Internet and Web, we still don't have a definite answer is scraping of public websites and public web content legal or illegal.

I was more thinking from a public perception side instead of legal, but legal would be a good question too.

Something like, "Yeah I totally respected your robots.txt file the only reason I have your data is because I crawled IA, see they are the ones you should be mad at not us"

link