|
I've built crawlers that retrieve billions of web pages every month. We had a whole team working modifying the crawlers to resolve website changes, to reverse engineer ajax requests and solve complex problems like captcha solvers. Bottom line, if someone wants to crawl your website they will. What you can do, however, is make it hard so that the vast majority of developers can't do it (e.g. My tech crawl billions of pages, but there was a whole team dedicated to keeping it going). If you have money to spend, there's Distill Networks or Incapsula that have good solutions. They block PhantomJS and browsers that use Selenium to navigate websites, as well as rate limit the bots. What I found really affective that some websites do is tarpit bots. That is, slowly increase the number of seconds it takes to return the http request. So after a certain amount of request to your site it takes 30+ seconds for the bot to get the HTML back. The downside is that your web servers need to accept many more incoming connections but the benefit is you'll throttle the bots to an acceptable level. I currently run a website that gets crawled a lot, deadheat.ca. I've written a simple algorithm that tarpits bots. I also throw a captcha every now and then when I see an IP address hits too often over a span of a few minutes. The website is not super popular and, in my case, it's pretty simple to differentiate between a human or bot. Hope this helps... |