Hacker News new | ask | show | jobs
by jadell 2137 days ago
One day I will right an extensive post (or set of them) about using Puppeteer to bypass sites' anti-bot measures. It's a fascinating (and annoying) cat-and-mouse game. But at the end of the day, almost all bot detection measures rely on using Javascript to report back metrics about the browser, but those measures are running in an environment where the bot completely controls what Javascript reports back.

One of my favorite tricks I've seen employed are detection measures that look to see if common detection bypass tricks have been implemented (like checking the toString output of commonly overridden native functions.)

https://theheadless.dev/posts/challenging-flows/#bot-detecti...

7 comments

I recently was working on the same thing (https://github.com/chris124567/puppeteer-bypassing-bot-detec...). The existing solutions (like the headless-cat-n-mouse repo) seemed to be pretty incomplete and easily detected. I got mine to pass all the checks on Antoine Vastel’s site along with Distil Networks’ and PerimeterX‘s bot detection (although in practice they may have other ways of detection like checking for rapid URL visits).

Something worth noting about toString is that it can now be undetectably modified (to fake “native code”) with the new ES6 Proxy object. There was a really interesting blog post written about this at https://adtechmadness.wordpress.com/2019/03/23/javascript-ta... (I also incorporated this into my project).

Using Proxy is key to a lot of bot detection avoidance.

edit: Really like that repo! I use a lot of those techniques as well.

CF seems to have started classifying browsers with no existing CF cookies as likely bots (a score of 10 or less, where 99 is a human and 1 is confirmed bot) for enterprise users of their Bot Management feature[0]. From my testing, it happens for both puppeteer and incognito tabs of Chrome, even with perfect IP reputation.

0: https://support.cloudflare.com/hc/en-us/articles/36002751945...

That would explain why I always see CF bot prompts when visiting a site for the first time or the hundredth time in Safari with a few layers of tracking protection and no third-party cookies. I prefer to answer captchas if that’s the price I pay for a bit more privacy, then...
CF and Google Captcha is really making the web unbrowsable with hardened browsers. The web is looking really grim for people who care about privacy these days.
This may be a very noob tool for this game but it has served me well and even though I'm guessing most people know about it, just sharing it for reference:

https://www.npmjs.com/package/puppeteer-extra-plugin-stealth

I wonder if google captcha will always be able to defeat puppeteer? Seems odd for google to publish a set of abuse-able APIs, and not be able to detect their use.
There are farms of people who literally sit around all day and solve CAPTCHAs - there's no surefire way to address this problem and it usually ends up in an orchestration of reputation-score tooling (including making a user fill out a CAPTCHA) to fingerprint a bot.

If you're good at spoofing all of that fingerprinting you'll blow straight past them - it's all client-side in-which you have control all the way down to the bits and bytes.

You can just use Google text to speech to solve reCaptcha
There are services that will solve captcha for you (including Google's) in "real-time", and with convenient APIs that allow for automation.
Please do! This is a very interesting topic. Looking forward to reading about it.
I know it's passe to say "hey there's an xkcd for that", but this is one of my all-time favorites, and it's directly relevant, so... enjoy! :)

https://xkcd.com/810/