| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by showerst 247 days ago
	A point orthogonal to this; consider whether you need browser automation at all. If a website isn't using Cloudflare or a JS-only design, it's generally better to skip playwright. All the major AIs understand beautifulsoup pretty well, and they're likely to write you a faster, less brittle scraper.

3 comments

Etheryte 247 days ago

The vast majority of the modern internet falls into one of those two buckets though, no?

link

showerst 247 days ago

I mostly scrape government data so the sites are a little 'behind' on that trend, but no. Even JS heavy sites are almost always pulling from a JSON or graphql source under the hood.

At scale, dropping the heavier dependencies and network traffic of a browser is meaningful.

link

suchintan 247 days ago

Yeah, reverse engineering APIs is another fantastic approach. They aren't enough if you are dealing with wizards (eg typeform), but they can work really well

link

suchintan 247 days ago

IF you can use crawlers, definitely do.

They aren't enough for anything that's login-protected, or requires interacting with wizards (eg JS, downloading files, etc)

link

pavel_lishin 247 days ago

If.

link