Hacker News new | ask | show | jobs
by coppsilgold 543 days ago
Why not create a library that you inject into the Chrome process though?

It seems to me that playing a cat and mouse game with these anti-bot systems is unnecessary. Design a system which mimics a legitimate user to such a degree that it's either indistinguishable from an actual user or would produce an unacceptable level of false positives for the detection system. This is not an even playing field, the bot has all the advantages.

For example:

- Enumerate all the possible ways in which the webpage can glean insight into user input/activity.

- Hook all these functions by injecting code into the browser. At a level above and completely inaccessible to anything the web page can do to detect/interfere.

- Create functions that mimic user activities (mouse pathing, aimless mouse wondering, random scrolls, clicks, text selections, etc)

- Feed the outputs of these functions into the functions that you hooked.

- Rip out whatever information you want from the Chrome data structures in memory. Can probably reuse CDP code here.

After all this, the only challenge that would remain is to perfect the input functions that are supposed to mimic a legitimate user. Depending on how sophisticated these anti-bot systems can/will get, you may also need to cultivate user browsing habit profiles to enter advertising/spying databases as real humans.

1 comments

> It seems to me that playing a cat and mouse game with these anti-bot systems is unnecessary. Design a system which mimics a legitimate user to such a degree that it's either indistinguishable from an actual user or would produce an unacceptable level of false positives for the detection system.

This is the most common misconception, challenges you face with browser automation at scale are not *automation* challenges.

You can use real human input, by having actual humans doing the input and you will still get blocked.

Automation at scale means running dozens to 100s of browser instances concurrently on the same hardware, then after you mitigate IP related issues is when you start running into actual challenges that are completely different from the actual automation part.

You have to research all the little quirks browsers have through the various APIs that they offer and then compare that data to real world data before you can start to actually fix the problems.

There are browsers which randomize such fingerprints such as Brave. The web page does not have any insight into your hardware that you cannot mitigate by having the browser fake the responses.

You can also use Linux features such as namespaces & TUN's[1] to properly utilize proxies. Something I noticed is that Chrome under --proxy-server=socks5:// is incapable of using HTTP3 (UDP) for example, perhaps a deliberate oversight.

[1] <https://github.com/xjasonlyu/tun2socks>

When scaling browser automation, generating random fingerprints for most common high entropy data points is counterproductive. It just ends up lowering your trust score and shifts attention to other browser properties with less entropy, making those primary identifiers.

For example, degrading canvas, WebGL, or WebGPU fingerprints (e.g., by introducing noise like Brave does) might lead anti-bot systems to either ignore them or punish you with captchas. Once ignored, other signals, such as screen resolution (just an example), become more important. While this helps people with privacy by blending in with users and a single user visiting a website normally will probably not notice much, an influx of multiple users with degraded fingerprints and similar resolutions become easy to detect and might get a captcha or get blocked (e.g. 30-50+ browser sessions generating cookies for a specific captcha concurrently).

You can spoof multiple resolutions and then add some other properties, but it requires consistency across all of them, which can come down to weird browser specific quirks as well as whatever the data set of the anti bot vendor contains (regardless of how accurate). There are only so many plausible values for each low entropy data point that anti-bot systems will give you a high score for, forcing you to spoof as many data points as possible to maintain a high trust score across many concurrent sessions and eventually scale back or hit a limit for your operation, or deal with captchas by solving them and lose to the competition that doesn't have to do that.

Fingerprinting at scale isn’t just about spoofing individual data points - it’s about aligning all points in a realistic way and knowing which and how they relate to eachother, which requires extensive data and research.

On proxies: flagged IPs with residential ASNs often work fine if the overall trust score is high, but degraded fingerprints like Brave’s can undermine that advantage and then it becomes a lot more important, though it's always nice to eliminate if you are able to do so.

Even a single script that performs actions too quickly on a website can trigger anti-bot measures, even if the bot isn't detected directly.
I'm not denying that, I'm saying it's not a difficult challenge to solve when u compare it to the others I mentioned.