Hacker News new | ask | show | jobs
by Pinbenterjamin 2525 days ago
According to the NDA with my company I can't reveal anything about the architecture beyond the fact that it is hosted locally on a homebuilt distributed system that randomly chooses from a pool of 120 residential IPs.

We do have human emulation routines that helped avoid most detection, and that library is decoupled in such a way that we can edit behavior down to the individual site.

Some sites are just so damn good and detecting us and I just don't get it.

3 comments

They can characterise the (browsing) behaviour of all their visitors, and then further characterise those who fall outside their "normal" thresholds. The outsiders that exhibit some sort of correlation (ie their characteristics are not independent of each other) are banned. Any quirks or patterns your systems have would be identifiable as "artificial", and even those that are randomised or seek to emulate humans will have features that are identifiable. An NDA is ineffective against machine learning.

The countermeasure would be to have a bunch of humans use the websites in any way they want, totally undirected, then use the totality of that browsing to facilitate your scraping probabilistically. It would be less efficient, but very difficult to catch.

That's the general direction I'd like to take. When we capture the inputs for the scrapers, I'd like to persist everything. Mouse jiggles, delays, idle time. I think it would definitely help advance the software.
In the grand scheme of things all of this is a wasteful process. Maybe you could direct your worklife towards other challenges that are more rewarding for society and equally profitable?
I think that's unjustified and a little rude. OP is providing an automated service for publicly accessible data that isn't accessible for automation. If the sources are notified and they are operating within the confines of the law, this is no different than writing a search engine crawler.
That crosses into personal attack. Please don't do that on Hacker News. We've had to ask you this before.

https://news.ycombinator.com/newsguidelines.html

OP is being reasonably compensated for something that is perfectly legal.
A pool of 120 residential ips is way too small - patterns are more emergent. Go for thousands, even better, hundreds of thousands. Outsource the residential proxy system to luminati or oxylabs.
This sounds, at best, ethically dubious and at worst illegal. Aaron Swartz was arrested and charged under hacking laws for doing exactly what you're describing.

Given that your run this division there is a good chance you are personally liable.

We have an enormous legal team that communicates constantly with end points to ensure they are aware of our scraping. And as I said in another comment, we store no results other than what is already available to anyone else using the web.

We've had this division for many many years, and before my time we paid another company to do this. There's no legal issues.

Your legal teamn is in contact with them, but their security is actively trying to block you? That doesn't make sense.

Computer security laws are very broad. It doesn't matter if it's just a website that the public can access. If you're accessing it in a matter that they don't want AND you're aware of that, then I struggle to see how your lawyers can justify it.

> Computer hacking is broadly defined as intentionally accesses a computer without authorization or exceeds authorized access.

https://definitions.uslegal.com/c/computer-hacking/

Hiding your user agent because you know they don't want automated retrieval of information is "without authorisation".

>Aaron Swartz was arrested and charged under hacking laws for doing exactly what you're describing.

Don't think connecting a computer to a private network to suck up subscriber data is comparable to scraping publicly accessible internet content.

These fear mongering comments always ignore the notice provision in the CFAA. Web scraping publicly accessible information is not "illegal" under the CFAA. That law, at most, only makes someone who continues scraping after being asked to stop potentially culpable.

First, the accuser needs to, at least, send a cease and desist letter to the accused asking them to stop accessing the protected computer. Second, the accused needs to ignore that request and keep accessing the protected computer.

Is it possible to build a solid CFAA case when those two things do not happen? I cannot find any examples.

https://iapp.org/news/a/can-a-cease-and-desist-notice-create...

My understanding of the case is that he was charged with evading JSTOR security, not for accessing the MIT network.
Although his charges were ridiculous, they involved physically connecting to a secure network without permission, not just scraping the public part of pages from his own networks.