Hacker News new | ask | show | jobs
by amitamb 2616 days ago
Apart from that Common Crawl respects robots.txt (which makes sense) so many sites you expect to see there are not indexed. Netflix, Facebook LinkedIn and many more. If common-crawl sees serious adoption those sites will modify their robots.txt but it's and chicken/egg problem.
1 comments

There is a simple solution: if companies do not respect do-not-track then why should we respect robots.txt?
Because then you end up in an arms race that the little guy usually does not win.

There are a significant number of crawlers out there that don't respect robots.txt. The usual response to them isn't to roll over dead, it's to get CloudFlare (on the technological end) and/or sic the lawyers on them (for CFAA, IP, or ToS violations).