| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by curun1r 2530 days ago

> Where do I start? Study the chromium source?

I'm curious why you'd jump straight to browser detection as the most likely culprit. When I was doing scraping, the far more common case was bot detection by origin and access patterns. It's just very difficult to make an automated scraper look like a residential or business user.

Where do you run your scraping operation? Is it in AWS or some other hosting provider, because that will get you blocked quickly by a lot of sites? Do you rate limit, including adding random jitter to mimic the way a human might use a browser?

There's scraping services available that essentially use a network of browsers on residential connections with their extension installed to get around scraping detection. It's much slower, but it's much more reliable. We also had some success by signing up with a bunch of the VPN providers (PIA, NordVPN, ExpressVPN, etc) and cycling through their servers frequently. Anything to avoid creating patterns that look automated or being tied to an IP that can be blacklisted. I'd start there before I'd worry about hacky javascript detection like in this story being what's tripping you up.

2 comments

Pinbenterjamin 2530 days ago

According to the NDA with my company I can't reveal anything about the architecture beyond the fact that it is hosted locally on a homebuilt distributed system that randomly chooses from a pool of 120 residential IPs.

We do have human emulation routines that helped avoid most detection, and that library is decoupled in such a way that we can edit behavior down to the individual site.

Some sites are just so damn good and detecting us and I just don't get it.

nutjob2 2530 days ago

They can characterise the (browsing) behaviour of all their visitors, and then further characterise those who fall outside their "normal" thresholds. The outsiders that exhibit some sort of correlation (ie their characteristics are not independent of each other) are banned. Any quirks or patterns your systems have would be identifiable as "artificial", and even those that are randomised or seek to emulate humans will have features that are identifiable. An NDA is ineffective against machine learning.

The countermeasure would be to have a bunch of humans use the websites in any way they want, totally undirected, then use the totality of that browsing to facilitate your scraping probabilistically. It would be less efficient, but very difficult to catch.

Pinbenterjamin 2530 days ago

That's the general direction I'd like to take. When we capture the inputs for the scrapers, I'd like to persist everything. Mouse jiggles, delays, idle time. I think it would definitely help advance the software.

UweSchmidt 2530 days ago

In the grand scheme of things all of this is a wasteful process. Maybe you could direct your worklife towards other challenges that are more rewarding for society and equally profitable?

pault 2530 days ago

I think that's unjustified and a little rude. OP is providing an automated service for publicly accessible data that isn't accessible for automation. If the sources are notified and they are operating within the confines of the law, this is no different than writing a search engine crawler.

dang 2530 days ago

That crosses into personal attack. Please don't do that on Hacker News. We've had to ask you this before.

https://news.ycombinator.com/newsguidelines.html

formercoder 2530 days ago

OP is being reasonably compensated for something that is perfectly legal.

arpa 2530 days ago

A pool of 120 residential ips is way too small - patterns are more emergent. Go for thousands, even better, hundreds of thousands. Outsource the residential proxy system to luminati or oxylabs.

underwater 2530 days ago

This sounds, at best, ethically dubious and at worst illegal. Aaron Swartz was arrested and charged under hacking laws for doing exactly what you're describing.

Given that your run this division there is a good chance you are personally liable.

Pinbenterjamin 2530 days ago

We have an enormous legal team that communicates constantly with end points to ensure they are aware of our scraping. And as I said in another comment, we store no results other than what is already available to anyone else using the web.

We've had this division for many many years, and before my time we paid another company to do this. There's no legal issues.

underwater 2530 days ago

Your legal teamn is in contact with them, but their security is actively trying to block you? That doesn't make sense.

Computer security laws are very broad. It doesn't matter if it's just a website that the public can access. If you're accessing it in a matter that they don't want AND you're aware of that, then I struggle to see how your lawyers can justify it.

> Computer hacking is broadly defined as intentionally accesses a computer without authorization or exceeds authorized access.

https://definitions.uslegal.com/c/computer-hacking/

Hiding your user agent because you know they don't want automated retrieval of information is "without authorisation".

Havoc 2530 days ago

>Aaron Swartz was arrested and charged under hacking laws for doing exactly what you're describing.

Don't think connecting a computer to a private network to suck up subscriber data is comparable to scraping publicly accessible internet content.

3xblah 2530 days ago

These fear mongering comments always ignore the notice provision in the CFAA. Web scraping publicly accessible information is not "illegal" under the CFAA. That law, at most, only makes someone who continues scraping after being asked to stop potentially culpable.

First, the accuser needs to, at least, send a cease and desist letter to the accused asking them to stop accessing the protected computer. Second, the accused needs to ignore that request and keep accessing the protected computer.

Is it possible to build a solid CFAA case when those two things do not happen? I cannot find any examples.

https://iapp.org/news/a/can-a-cease-and-desist-notice-create...

underwater 2530 days ago

My understanding of the case is that he was charged with evading JSTOR security, not for accessing the MIT network.

morpheuskafka 2530 days ago

Although his charges were ridiculous, they involved physically connecting to a secure network without permission, not just scraping the public part of pages from his own networks.

brlewis 2530 days ago

> rate limit, including adding random jitter to mimic the way a human might use a browser

Even if you aren't trying to disguise anything, adding some randomness helps avoid one particular bad pattern with operations on a network. I recall the pattern being called "network synchronization" but I can't get good search results for that.