| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Pinbenterjamin 2525 days ago

According to the NDA with my company I can't reveal anything about the architecture beyond the fact that it is hosted locally on a homebuilt distributed system that randomly chooses from a pool of 120 residential IPs.

We do have human emulation routines that helped avoid most detection, and that library is decoupled in such a way that we can edit behavior down to the individual site.

Some sites are just so damn good and detecting us and I just don't get it.

3 comments

nutjob2 2525 days ago

They can characterise the (browsing) behaviour of all their visitors, and then further characterise those who fall outside their "normal" thresholds. The outsiders that exhibit some sort of correlation (ie their characteristics are not independent of each other) are banned. Any quirks or patterns your systems have would be identifiable as "artificial", and even those that are randomised or seek to emulate humans will have features that are identifiable. An NDA is ineffective against machine learning.

The countermeasure would be to have a bunch of humans use the websites in any way they want, totally undirected, then use the totality of that browsing to facilitate your scraping probabilistically. It would be less efficient, but very difficult to catch.

link

Pinbenterjamin 2525 days ago

That's the general direction I'd like to take. When we capture the inputs for the scrapers, I'd like to persist everything. Mouse jiggles, delays, idle time. I think it would definitely help advance the software.

link

UweSchmidt 2525 days ago

In the grand scheme of things all of this is a wasteful process. Maybe you could direct your worklife towards other challenges that are more rewarding for society and equally profitable?

link

pault 2525 days ago

I think that's unjustified and a little rude. OP is providing an automated service for publicly accessible data that isn't accessible for automation. If the sources are notified and they are operating within the confines of the law, this is no different than writing a search engine crawler.

link

dang 2525 days ago

That crosses into personal attack. Please don't do that on Hacker News. We've had to ask you this before.

https://news.ycombinator.com/newsguidelines.html

link

formercoder 2525 days ago

OP is being reasonably compensated for something that is perfectly legal.

link

arpa 2525 days ago

A pool of 120 residential ips is way too small - patterns are more emergent. Go for thousands, even better, hundreds of thousands. Outsource the residential proxy system to luminati or oxylabs.

link

underwater 2525 days ago

This sounds, at best, ethically dubious and at worst illegal. Aaron Swartz was arrested and charged under hacking laws for doing exactly what you're describing.

Given that your run this division there is a good chance you are personally liable.

link

Pinbenterjamin 2525 days ago

We have an enormous legal team that communicates constantly with end points to ensure they are aware of our scraping. And as I said in another comment, we store no results other than what is already available to anyone else using the web.

We've had this division for many many years, and before my time we paid another company to do this. There's no legal issues.

link

underwater 2525 days ago

Your legal teamn is in contact with them, but their security is actively trying to block you? That doesn't make sense.

Computer security laws are very broad. It doesn't matter if it's just a website that the public can access. If you're accessing it in a matter that they don't want AND you're aware of that, then I struggle to see how your lawyers can justify it.

> Computer hacking is broadly defined as intentionally accesses a computer without authorization or exceeds authorized access.

https://definitions.uslegal.com/c/computer-hacking/

Hiding your user agent because you know they don't want automated retrieval of information is "without authorisation".

link

Havoc 2525 days ago

>Aaron Swartz was arrested and charged under hacking laws for doing exactly what you're describing.

Don't think connecting a computer to a private network to suck up subscriber data is comparable to scraping publicly accessible internet content.

link

3xblah 2525 days ago

These fear mongering comments always ignore the notice provision in the CFAA. Web scraping publicly accessible information is not "illegal" under the CFAA. That law, at most, only makes someone who continues scraping after being asked to stop potentially culpable.

First, the accuser needs to, at least, send a cease and desist letter to the accused asking them to stop accessing the protected computer. Second, the accused needs to ignore that request and keep accessing the protected computer.

Is it possible to build a solid CFAA case when those two things do not happen? I cannot find any examples.

https://iapp.org/news/a/can-a-cease-and-desist-notice-create...

link

underwater 2525 days ago

My understanding of the case is that he was charged with evading JSTOR security, not for accessing the MIT network.

link

morpheuskafka 2525 days ago

Although his charges were ridiculous, they involved physically connecting to a secure network without permission, not just scraping the public part of pages from his own networks.

link