Hacker News new | ask | show | jobs
by simonw 1412 days ago
It's increasingly difficult these days to write scrapers that don't at some point need to execute JavaScript on a page - so you need to have a good browser automation tool on hand.

I'm really impressed by Playwright. It feels like it has learned all of the lessons from systems like Selenium that came before it - it's very well designed and easy to apply to problems.

I wrote my own CLI scraping tool on top of Playwright a few months ago, which has been a fun way to explore Playwright's capabilities: https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...

4 comments

My $0.02, but in most cases, I have seen you don't need to emulate a browser to scrape even if it's an SPA. The data has to be coming from somewhere. You can play around devtools to reverse engineer the API requests and get the data you need. I understand companies can put roadblocks to hinder this, but my point is, browser emulation is slow and expensive resource-wise. It should be the last resort.
My £0.02

It's usually easier to use an Android emulator like GenyMotion or a rooted Android phone and use HTTPToolkit and/or some certificate bypassing method using Frida or other and then explore APIs through their official apps.

I've scraped loads of stuff through unofficial APIs before this way. Most developers don't ever expect people to do this so they're often a bit less secure too.

Alternatively sometimes doing a Global GitHub / Sourcegraph search you might find someone else who's done the hard work to reverse engineer an API and open-sourced it.

Have you had any luck with FB this way? There's local history groups I'd dearly like to back up for future generations - plus posts from 6 months+ ago are already hard to get to.
Honestly haven't tried. Facebook's services are generally pretty rock solid in terms of security though, and any efforts of reverse engineering (e.g. Messenger) I've found seem to get abandoned due to the effort required.

You'd probably be best sticking with web page scraping via something like Puppeteer, but even that'd be difficult.

> Most developers don't ever expect people to do this so they're often a bit less secure too.

Yikes.

I faintly remember a story from a couple years ago where some pizza ordering app simply changed some get parameter to paid=yes after the user completed the payment process. Guess what happened when the guy who poked around the app set that parameter to yes before doing the payment step....
He went to jail?
Yeah, my first step in trying to scrape an SPA is always to hit the network tab in the browser, filter by type=JSON and then sort by size. The largest responses are often the most useful, and can then be grabbed with curl.

Sometimes though that's not enough - particularly on older sites that might use weirder concepts like ASP.NET View state. For those I find having Playwright around is a big benefit.

Generally the things I have the most trouble with for non-browser-automation scraping are things with complex state stored in cookies and URL fragments (and maybe even localStorage these days).

I'm struggling with a site at the moment for exactly this reason, it requires an auth key to return the products.

Puppeteer had it working, but if I need to do this in a google firebase function, it would be so much better to get it to simple fetch requests.

For the e-commerce sites I'm looking at though, most just are server side rendered, which is so much easier. In that case i use `cheerio` which has jQuery like syntax for crawling the raw text DOM.

Totally agree!

I've done this method a lot. Honestly scraping Google Reviews was the most difficult in terms of complexity. This was like 6 or 7 years ago. You would get back these huge nested arrays that mostly had 0s in them. Occasionally a value would be set and that's what I would go with. I'm assuming their internal tools were obfuscated and/or using protobuf. But it certainly took me back to the good ol' days hexediting games in order to make your own cheat codes.

Another difficulty I faced were sites that relied on the previous UI state to pass the API call. You'd have to emulate "real" browsing by requesting the subsequent pages and get the ID number. Still much faster than emulating the whole browser via Selenium.

Honestly, it was the small sites that actually proved more troublesome. The ones that had an actual admin reading logs. They would ban our whole IP Block, then ban our whole proxy IP Block. Once I implemented TOR functionality into our scraper for a particularly valuable but small site and they blocked that too. This site ended up implementing ludicrous rate limiting that had normal users waiting for 2-3 seconds between requests, all because we were scraping their data. I kid you not, by the time we gave up trying, this Section-8 rental site for a small city had vastly more protections in place than Zillow and Apartments.com combined.

I was thinking you talking about Instagram :) i knows guy who has similar problems with them.
Did you ever approached them?
This was my approach too and it's been working great. Nowadays data isn't rendered directly into HTML anymore, it gets downloaded from some JSON API endpoint. So I use network monitoring tools to see where it's coming from and then inferface with the endpoint directly. I essentially wrote custom clients for someone else's site. One of my scrapers is actually just curl piped into jq. Sometimes they change the API and I have to adapt but that's fine.

> I understand companies can put roadblocks to hinder this

Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

> Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

Cloudflare Bot Protection[1] is a popular one. The website is guarded by a layer of code that needs to be executed before continuing. Normal browsers will follow through. It can be hard to bypass.

[1]: https://www.cloudflare.com/pg-lp/bot-mitigation-fight-mode/

I have a codebase that defeats cloudflare protection. Felt like I had keys to kingdom.
So that would break text browsers too, right? :(

And users with JS disabled for privacy reasons.

"It's increasingly difficult these days to write scrapers that don't at some point need to execute Javascript on a page - so you need to have a good browser automation tool on hand."

But doesn't this assume which sites are being "scraped". How would anyone know what sites someone else needs to "scrape" unless people name the sites (and the specific pages at the sites as this is not "crawling"). For example, none of the websites with webpages I extract data from require me to use Javascript, i.e., I can retrieve and extract data without using JS.

Also, it is possible to automate text-only browsers that do not run Javascript. "Browser automation" is not necessarily just for Javascript.

Maybe we should have a "scraping challenge" in an effort to provide some evidence on this question. The challenge could be to "scrape" every webpage currently submitted to HN,^1 without using Javascript.^2

If someone manages to scrape a majority of the pages submitted to HN without JS, then we have some evidence that, for HN readers, JS and therefore Javascript-enabled browser automation is generally _not_ required for "scraping".

1. The problem with I see using something more generic like majestic_million.csv is that it is a list of domain names not webpages.

2. We would likely need to agree on what data would need to be extracted from each submitted page.

I'll rephrase:

"It's increasingly difficult these days to regularly write scrapers for a large range of different websites without eventually hitting a situation where you need to execute JavaScript on a page"

What happens in Playwright with sites using CAPTCHAs?

I had been occasionally scraping a site via curl, but then they started using Cloudflare's anti-bot stuff.

I switched to Selenium and that worked for a while--my Selenium script would navigate to the site, pause to let me manually deal with Cloudflare, and then automatically grab the data I wanted. But then that stopped working.

I found a Stack Overflow answer that gave some settings in Selenium to make it not tell the site's JavaScript that the browser was being automated and that briefly made things happy, but not too long afterwards that broke. There's a Selenium Chrome drive available that is meant for scraping which apparently tries to hide all evidence that the browser is being automated, but it didn't fool Cloudflare.

What I want is a browser-based automation tool that to the site is indistinguishable from a human browsing, except possibly by the timing of user actions. E.g., if the site can deduce it is being automated because the client responds faster than human reaction time, or with too little variation in response time, that's fine.

Totally separate question, but I'm wondering why you put 'Mar' in your url instead of the month number?
It's a decision from 2003 I think. It's mainly because I'm from the UK, so I'm extremely sensitive to the risk of people confusing DD-MM-YYYY and MM-DD-YYYY - the least ambiguous format is to use DD-Mon-YYYY, so I picked that for my URLs.

If I was designing my blog today I'd probably drop the day and month entirely, and go with /yyyy/unique-text-slug for the URLs.

ISO 8601 is least ambiguous, as there's no question of "endian-ness".

https://en.wikipedia.org/wiki/ISO_8601

YYYY-MM-DD

Yes, with year last, most of the world does it one way, but most of the audience are often from a part of the world that (a) thinks it's most of the world, and (b) does it the wrong way by shuffling endian-ness.

The killer feature of ISO-8601 style dates is that you don't need to parse the dates to sort them. Lexicographical order is chronological order. That's a pretty huge deal.
This. Two additional benefits of this approach are it sorts correctly and it's already standard in China, which is effectively a whole heap of the world's internet population.
> the least ambiguous format is to use DD-Mon-YYYY

I consider YYYY-MM-DD to be the least ambiguous. But now that I look at the above, I guess the author is saying that since MM-DD could possibly be considered as DD-MM? Yuck.

least ambiguous is ISO 86 since it is standardised, but I agree that alpha month is unambiguous.
Thanks simon! Makes perfect sense.