| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by amitamb 2616 days ago
	Apart from that Common Crawl respects robots.txt (which makes sense) so many sites you expect to see there are not indexed. Netflix, Facebook LinkedIn and many more. If common-crawl sees serious adoption those sites will modify their robots.txt but it's and chicken/egg problem.

1 comments

amelius 2616 days ago

There is a simple solution: if companies do not respect do-not-track then why should we respect robots.txt?

link

nostrademons 2615 days ago

Because then you end up in an arms race that the little guy usually does not win.

There are a significant number of crawlers out there that don't respect robots.txt. The usual response to them isn't to roll over dead, it's to get CloudFlare (on the technological end) and/or sic the lawyers on them (for CFAA, IP, or ToS violations).

link