| The reason in my experience is that there's a high barrier of entry for most devs when it comes to setting up an environment for Chromium and a workflow for patches that still allows you to quickly and easily pull in and apply upstream changes whenever a new Chromium version releases. In reality, if you know how to use CDP correctly and you have control over the environment that you run the browser in, you have to make very few browser patches. What I mean with using CDP correctly is that, yes it is detectable to a certain extent but it comes down to things like enabling Runtime domain for example which you can easily mitigate in your own solution but is something that libraries like puppeteer / playwright often do out of the box (this is where the "stealth" versions of these libraries come in, they will either mitigate by disabling features or use some hacky approaches to instrument the JS that runs on the pages). Then when you move into an environment that is a lot more stripped down (let's say from your home machine to docker) now you run into A LOT of issues that you definitely are better off fixing with browser patches, however figuring out what those issues are and how to fix them is a huge feat in itself and often will require you to have the ability to reverse engineer things like Cloudflare, Akamai and other anti bot vendors just to know what leaks you still have to patch. It doesn't help that there is no end to misinformed articles on things like "browser fingerprinting" that you encounter when you try to solve your issues the first time you encounter them, a lot of articles based on nothing but superstition, articles that basically say "proxies are never good enough", "captchas are getting out of hand" that get things wrong and will just eat away at your sanity while trying to debug issues. This is long enough of a rant already but maybe offers you some insight, if you have any specific questions feel free to ask. |
It seems to me that playing a cat and mouse game with these anti-bot systems is unnecessary. Design a system which mimics a legitimate user to such a degree that it's either indistinguishable from an actual user or would produce an unacceptable level of false positives for the detection system. This is not an even playing field, the bot has all the advantages.
For example:
- Enumerate all the possible ways in which the webpage can glean insight into user input/activity.
- Hook all these functions by injecting code into the browser. At a level above and completely inaccessible to anything the web page can do to detect/interfere.
- Create functions that mimic user activities (mouse pathing, aimless mouse wondering, random scrolls, clicks, text selections, etc)
- Feed the outputs of these functions into the functions that you hooked.
- Rip out whatever information you want from the Chrome data structures in memory. Can probably reuse CDP code here.
After all this, the only challenge that would remain is to perfect the input functions that are supposed to mimic a legitimate user. Depending on how sophisticated these anti-bot systems can/will get, you may also need to cultivate user browsing habit profiles to enter advertising/spying databases as real humans.