Hacker News new | ask | show | jobs
by loceng 4496 days ago
I think a better business model would be creating a service that identifies scrapers, and then blocks them. I think one might already exist, though I can't remember its name.
1 comments

I don't think either is a really great idea. I think most of the people who would pay to block web scrapers are either being paranoid or are being scraped by people smart and resourceful enough to get around your filters. Any serious web scraper is going to be scripting a real browser engine, so it's going to act just like a real visitor.
There are ways to detect scrapers and other bots if you really want to and services that do so.
Adding a captcha to every page, maybe? There are services that will charge you money for this, but that doesn't mean it works.
Captchas don't do anything to stop bots, they just add a small additional cost(~$1.40 per 1000 solved). I am talking about monitoring things that 90% of bots generally do not take precautions against, like tracking mouse movements and other things I won't mention here, that distinguish them from humans.
I don't believe you. At best you can obfuscate and confuse scrapers. You can't stop them from reading a public web page. (And I shudder to think what these solutions must do to accessibility -- hope you don't have any blind readers.)
Oh, I wasn't saying you can completely stop them from reading a page or individual pages. But there is activity, than can be detected as irregular. Here is a true example I know about someone who wanted to scrape their competitor's client listings. The competitor had a map with points of their customers with random user IDs and no where was the entire dataset visible. The person just built a scraper/bot, to hit every single possible ID of over 10,000 numbers. They hit a ton of empty pages, and that company should have recognized an IP incrementally crawling their data, especially empty pages...This activity should have been recognized and resulted in an IP ban.
Do you know the names of any such services off the top of your head? Thanks
If you want to learn about the concept and armsrace, here is a paper with plenty of resources (this is in a game context, though not website, although there is the most advanced detection here): http://iseclab.org/papers/botdetection-article.pdf

Here is an open source system demo'd at BlackHat Europe 2011 (that checks it is a proper browser (with DOM/Javascript/etc), also good against DDoS. https://github.com/yuri-gushin/Roboo

Project Honeypot (scans inbound ips) good against spambots: https://www.projecthoneypot.org/

Here are some commercial solutions: CloudFlare's ScrapeShield -https://www.cloudflare.com/apps/scrapeshield

Distil Networks - http://www.distilnetworks.com/

Scrape Sentry - http://www.scrapesentry.com/

Fireblade - http://www.fireblade.com/

Great. Thank you.