Hacker News new | ask | show | jobs
by simonw 2059 days ago
If you're using JavaScript for scraping, you should go straight to the logical conclusion and run your scraper inside a real browser (potentially headless) - using Puppeteer or Selenium or Playwright.

My current favourite stack for this is Selenium + Python - it lets me write most of my scraper in JavaScript that I run inside of the browser, but having Python to control it means I can really easily write the results to a SQLite database while the scraper is running.

I wrote a bit about this here: https://simonwillison.net/2020/Oct/16/weeknotes-evernote-dat...

4 comments

IMO, for the most of the data-gathering needs running a browser (even a headless one) would be an overkill. Browser is better suited for complex interactions, when you need to fully pretend to be a user. Or just for testing purposes so your environments match.
I've used Selenium API running in Firefox in the past to scrape customers data out of proprietary .Net WebForm systems requiring a login that didn't offer any option to export the data.

Crawling the list pages and then each edit page in turn allowed for dumping the name and value from each input field to the log as key:value pairs for processing offline.

Navigating paging was probably the biggest challenge.

I have done the same, to "export" 10s of thousands of pages from a client's Sitecore website where they were in a very adversarial relationship with the incumbent Sitecore dev/hosts.

I totally don't recommend doing this. But it worked for this case.

"I hate to advocate drugs, alcohol, violence, or insanity to anyone, but they've always worked for me." -- Hunter S Thompson

Selenium is great, but seems to be easier to detect and block.
Selenium lets you add random delays between your actions which could help avoid triggering a firewall to block you.

Good practice anyway so you don't overload the site and find your logs empty or full of gaps.

Good approach, but advanced Selenium detection goes beyond heuristics. Selenium injects JavaScript into the page to function, and the presence of this is how Selenium is detected.
Interesting, I've worked on both sides of scraping and protecting content but hadn't really considered checking for JavaScript frameworks as a trigger. I'm assuming this is something you could configure in a F5 that also injects its own JavaScript?

Randomising field names, seeding hidden bogus data and messing with element order was more what I would look at once a persistent scraper was using enough IPs to get around rate limits.

I agree with both you and the post you're replying to.

One comment though, once you've past that "I can't do this without a real browser" line in the sand a few times, you end up with a collection of snippets and skills that moves that line much closer. Sure, I'll load the page and watch in browser tools to see what's in the html and what's coming back to XHR calls, but when I've got a directory full of previously used example code to fire up that uses Python/Selenium and deals with "boilerplate" parts, it's a much easier decision to jump that way than the first time I stared at the BeautifySoup documentation.

(When the only tool you have is a nailgun, every problem looks like a messiah...)

most sites these days are single page apps. Unless cheerio and phantomjs work well with those (have not tried), I don't see any other option. Benefit of a browser is that it does multi-processing much better than you do. I only need to add some custom code to block non-js requests to improve performance a bit.

Like if you do ad-hoc web scrapping then it's fine to spend time looking for the most efficient way, but if your web scrapping framework is part of a data pipeline that scrapes all sort of website then a browser is the most development time-saving route.

I do that for the background scraping. (via userscript that parses the data out of the page I visit and stores info to database in the background)

So for example if I buy some electronics module on aliexpress, my scrapper automatically saves all the product description and images to the database right from the browser as I'm making the order.

These details usually contain vital info to use the module, so it's important to me to have an easily searchable reference for all this information. I really don't trust myself to collect all the necessary info manually.

I've used puppeteer + better-sqlite3 in node for similar jobs in the past... Great combo, but tend to use it only if/when node-fetch + cheerio aren't feasable.
what's the easiest way to have selenium execute JavaScript on the currently loaded page?