| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tzs 1408 days ago

What happens in Playwright with sites using CAPTCHAs?

I had been occasionally scraping a site via curl, but then they started using Cloudflare's anti-bot stuff.

I switched to Selenium and that worked for a while--my Selenium script would navigate to the site, pause to let me manually deal with Cloudflare, and then automatically grab the data I wanted. But then that stopped working.

I found a Stack Overflow answer that gave some settings in Selenium to make it not tell the site's JavaScript that the browser was being automated and that briefly made things happy, but not too long afterwards that broke. There's a Selenium Chrome drive available that is meant for scraping which apparently tries to hide all evidence that the browser is being automated, but it didn't fool Cloudflare.

What I want is a browser-based automation tool that to the site is indistinguishable from a human browsing, except possibly by the timing of user actions. E.g., if the site can deduce it is being automated because the client responds faster than human reaction time, or with too little variation in response time, that's fine.