Hacker News new | ask | show | jobs
by pkiv 848 days ago
In the end, the best way to avoid being blocked is to be a good actor. All of these hacks won't stop someone who's determined to prevent access (ie: LinkedIn).

That's actually one of the reasons why I started https://browserbase.com/. Maintaining headless browser infrastructure can be such a pain. I've spent a lot of time managing headless chrome fleets at scale, so happy to answer any questions.

2 comments

Are there any stories you're willing to share, any tough nuts you've had to crack to improve some aspect of operations, whether it be reliability, performance, bot detection evasion, or something else completely?

I've only dealt with scraping on a small scale and I quickly realized that running "browsers as a service" is a pain in the ass, they're not exactly lightweight, they like to get "stuck", balloon in memory or some such.

I imagine your business will be quite successful if reliability is good and the price is right!

I gave a lightning talk on headless chrome here that is worth checking out!

https://www.youtube.com/watch?v=vs-qzlW9M50&t=726s

If I understand correctly, a lot of the issues you can run into with regards to blocking come from the fact that you're using a headless browser. Past a certain point, wouldn't it be less work to use a regular browser and drive with Selenium or similar solutions? Or does that not address the kind of problems you're facing?
I used to semi-automate access to some sites by using Selenium with a non-headless browser. These were sites where there were just one or two pages where I wanted some automation to fill out a form or scrape some data, and they frequently made changes to the home page that made it hard to automate navigating from the home page to the pages I wanted to automate.

The idea was to have a script use Selenium to launch non-headless Chrome and then wait:

  driver = Chrome()
  driver.get("https://example.org")
  input("Press enter when ready")
I could then manually deal with logging in, answering any CAPTCHA that came up, and navigate to the page I wanted to run my automation. Then I could press "enter" in my terminal and my script would continue.

That used to work fine, but then on sites using Cloudflare's CAPTCHA it stopped working. Solving the CAPTCHA would just result in another CAPTCHA.

I tried an alternative Selenium Chrome driver that was supposed to be more stealthy, and tried setting various flags that were supposed to make it so JavaScript could not tell that Selenium was there, and those worked for a while, but then they stopped working.

The results were similar using Selenium with Firefox.

I also tried Puppeteer, with Chromium and Firefox, and they too could not get past the CAPTCHA loops.

I then tried Playwright. With Chromium and Webkit that got the CAPTCHA loops. With Firefox it actually worked. I didn't even see the CAPTCHA. The non-interactive check for not being a bot passed.

Still, the whole approach seems fragile. I don't know if Firefox/Playwright working was due to some fundamental difference between Firefox and the others or just Cloudflare having not yet gotten around to dealing with it.

The newest version of headless chrome actually runs the same code as a "regular browser": https://developer.chrome.com/docs/chromium/new-headless
I created a dedicated chrome profile (--user-data-dir) signed in to a few sites and then drive it, with visible window from scripts.

Does all my crawling, it goes very slow, it's never trigger the bot detectors.