Hacker News new | ask | show | jobs
by centdev 4931 days ago
Are there known user agents that you can identify as coming from the scrapers? If so, you either block that way and not worry about IP addresses or choose to disable Google Analytics on those requests so it doesn't skew GA data.
1 comments

As was identified, most of the scraping occurs via Selenium. Selenium just automates a real browser, so it'd look like 100% legit traffic with a legit user agent (it defaults to using Firefox)
So the solution is simple isn't it? Google Analytics filter > filter referers like Selenium (or the entire EC2 block).
Not sure how you could filter Selenium. A well written script looks like normal traffic.
A well written scraper looks like normal traffic in the same way that a well written pseudo random number generator looks like a random number generator. It'll fool your eye but not statistical analysis.

Think about the goal of a scraper, it needs to actually walk through all the content. That doesn't look like a normal user at all. An individual request might look ok, but in aggregate the pattern of a robot pops out.

https://www.usenix.org/conference/usenixsecurity12/pubcrawl-...

So, would that detection mechanism be able to deal with a number of coordinated scrapers rotating through lists of proxies, using different User-Agent strings, making requests with (pseudo-)random delays between requests?
yup, read the paper
Not all scrapers spider a site's contents. Sometimes they are going after something specific.