Hacker News new | ask | show | jobs
by photochemsyn 854 days ago
I've found this approach works really well using JavaScript and puppeteer for the first stage, and then Python for the second stage (the re module for regular expressions is nice here IMO).

JS/puppeter seems a bit easier for things like rotating user agents, from article:

> "Websites often block scrapers via blocked IP ranges or blocking characteristic bot activity through heuristics. Solutions: Slow down requests, properly mimic browsers, rotate user agents and proxies."

1 comments

If you're using JS in the first step just because you need puppeteer, check out playwright. It's what the original authors of puppeteer are working on now and it's been more actively developed in the past few years, very similar in usage and features, but it also has an official python wrapper package.