| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by centdev 4978 days ago
	Are there known user agents that you can identify as coming from the scrapers? If so, you either block that way and not worry about IP addresses or choose to disable Google Analytics on those requests so it doesn't skew GA data.

1 comments

bdcravens 4978 days ago

As was identified, most of the scraping occurs via Selenium. Selenium just automates a real browser, so it'd look like 100% legit traffic with a legit user agent (it defaults to using Firefox)

link

chewxy 4978 days ago

So the solution is simple isn't it? Google Analytics filter > filter referers like Selenium (or the entire EC2 block).

link

bdcravens 4978 days ago

Not sure how you could filter Selenium. A well written script looks like normal traffic.

link

seats 4978 days ago

A well written scraper looks like normal traffic in the same way that a well written pseudo random number generator looks like a random number generator. It'll fool your eye but not statistical analysis.

Think about the goal of a scraper, it needs to actually walk through all the content. That doesn't look like a normal user at all. An individual request might look ok, but in aggregate the pattern of a robot pops out.

https://www.usenix.org/conference/usenixsecurity12/pubcrawl-...

link

brechin 4978 days ago

So, would that detection mechanism be able to deal with a number of coordinated scrapers rotating through lists of proxies, using different User-Agent strings, making requests with (pseudo-)random delays between requests?

link

seats 4978 days ago

yup, read the paper

link

bdcravens 4978 days ago

Not all scrapers spider a site's contents. Sometimes they are going after something specific.

link