Hacker News new | ask | show | jobs
by bdcravens 4931 days ago
Not sure how you could filter Selenium. A well written script looks like normal traffic.
2 comments

A well written scraper looks like normal traffic in the same way that a well written pseudo random number generator looks like a random number generator. It'll fool your eye but not statistical analysis.

Think about the goal of a scraper, it needs to actually walk through all the content. That doesn't look like a normal user at all. An individual request might look ok, but in aggregate the pattern of a robot pops out.

https://www.usenix.org/conference/usenixsecurity12/pubcrawl-...

So, would that detection mechanism be able to deal with a number of coordinated scrapers rotating through lists of proxies, using different User-Agent strings, making requests with (pseudo-)random delays between requests?
yup, read the paper
Not all scrapers spider a site's contents. Sometimes they are going after something specific.