Hacker News new | ask | show | jobs
by _QrE 409 days ago
Is there a reason for using Selenium over something like Playwright? I haven't had very many positive experiences with selenium, and playwright I found is easier to use and more flexible.

Also, for stuff like this:

`modified_value = original_value.replace("HeadlessChrome", "Chrome")`

There's quite a few ways to figure out that a browser is a bot, and I don't think replacing a few values like this does much. Not asking you to reveal any tricks, just saying that if you're using something like Playwright, you can e.g. run scripts in the browser to adjust your fingerprint more easily.

4 comments

I am quite aware, but I actually built most of the scraping logic a long time ago, before I even knew that playwright was a thing.

I am looking to refactor a lot of this, and switching over to playwright is a high priority, using something like camoufox for scraping, instead of just chromium.

Most of my work on this the past month has been simple additions that are nice to haves

I was in a similar boat with my scrapers. Started with Selenium 5-6 years ago and only discovered Playwright 2 years ago. Spend a month or so swapping the two, which was well worth it. Cleaner API, async support.
Playwright was miles ahead of selenium but what I think is really overlooked is chromedp
Luckily, I have some experience with playwright, so swapping shouldn't take me too long.

Currently working on a PR to swap over

If you're a fan of Playwright check out Crawlee [0]. I've used it for a few small projects and it's been faster for me to get what I've needed done.

[0] https://crawlee.dev/

It's by apify which is an interesting community
With the custom headers, you can actually trick a lot of sites with bot protection to let you load their sites (even big sites like youtube, which I have found success in)
How do you work around pop-ups for newsletters and such? Look at the BBC for a good example.
Pack ad blockers into your containers. They can be loaded into Chrome and help immensely in suppressing popovers while crawling.
Thank you, I'll experiment with that. Tips and advice welcome!
Another cool trick is to deny all the content types you don't care about in your playwright. so if you only want text why bother allowing requests for fonts, css, svgs, images, videos, etc

Just request the html and cap down all the other stuff

PS: I also think this has the nice side-effect of you consuming less resources (that you didnt care about/need anyways) from the server, so win win

That is a great tip, thank you!
Last time I looked, Selenium was able to use Firefox. IDK about Playwright, but Puppeteer was Chrome-only.
Playwright supports Firefox, chromium and webkit