Hacker News new | ask | show | jobs
by nutjob2 2525 days ago
They can characterise the (browsing) behaviour of all their visitors, and then further characterise those who fall outside their "normal" thresholds. The outsiders that exhibit some sort of correlation (ie their characteristics are not independent of each other) are banned. Any quirks or patterns your systems have would be identifiable as "artificial", and even those that are randomised or seek to emulate humans will have features that are identifiable. An NDA is ineffective against machine learning.

The countermeasure would be to have a bunch of humans use the websites in any way they want, totally undirected, then use the totality of that browsing to facilitate your scraping probabilistically. It would be less efficient, but very difficult to catch.

1 comments

That's the general direction I'd like to take. When we capture the inputs for the scrapers, I'd like to persist everything. Mouse jiggles, delays, idle time. I think it would definitely help advance the software.
In the grand scheme of things all of this is a wasteful process. Maybe you could direct your worklife towards other challenges that are more rewarding for society and equally profitable?
I think that's unjustified and a little rude. OP is providing an automated service for publicly accessible data that isn't accessible for automation. If the sources are notified and they are operating within the confines of the law, this is no different than writing a search engine crawler.
That crosses into personal attack. Please don't do that on Hacker News. We've had to ask you this before.

https://news.ycombinator.com/newsguidelines.html

OP is being reasonably compensated for something that is perfectly legal.