Hacker News new | ask | show | jobs
by shubhamjain 1409 days ago
My $0.02, but in most cases, I have seen you don't need to emulate a browser to scrape even if it's an SPA. The data has to be coming from somewhere. You can play around devtools to reverse engineer the API requests and get the data you need. I understand companies can put roadblocks to hinder this, but my point is, browser emulation is slow and expensive resource-wise. It should be the last resort.
4 comments

My £0.02

It's usually easier to use an Android emulator like GenyMotion or a rooted Android phone and use HTTPToolkit and/or some certificate bypassing method using Frida or other and then explore APIs through their official apps.

I've scraped loads of stuff through unofficial APIs before this way. Most developers don't ever expect people to do this so they're often a bit less secure too.

Alternatively sometimes doing a Global GitHub / Sourcegraph search you might find someone else who's done the hard work to reverse engineer an API and open-sourced it.

Have you had any luck with FB this way? There's local history groups I'd dearly like to back up for future generations - plus posts from 6 months+ ago are already hard to get to.
Honestly haven't tried. Facebook's services are generally pretty rock solid in terms of security though, and any efforts of reverse engineering (e.g. Messenger) I've found seem to get abandoned due to the effort required.

You'd probably be best sticking with web page scraping via something like Puppeteer, but even that'd be difficult.

> Most developers don't ever expect people to do this so they're often a bit less secure too.

Yikes.

I faintly remember a story from a couple years ago where some pizza ordering app simply changed some get parameter to paid=yes after the user completed the payment process. Guess what happened when the guy who poked around the app set that parameter to yes before doing the payment step....
He went to jail?
Yeah, my first step in trying to scrape an SPA is always to hit the network tab in the browser, filter by type=JSON and then sort by size. The largest responses are often the most useful, and can then be grabbed with curl.

Sometimes though that's not enough - particularly on older sites that might use weirder concepts like ASP.NET View state. For those I find having Playwright around is a big benefit.

Generally the things I have the most trouble with for non-browser-automation scraping are things with complex state stored in cookies and URL fragments (and maybe even localStorage these days).

I'm struggling with a site at the moment for exactly this reason, it requires an auth key to return the products.

Puppeteer had it working, but if I need to do this in a google firebase function, it would be so much better to get it to simple fetch requests.

For the e-commerce sites I'm looking at though, most just are server side rendered, which is so much easier. In that case i use `cheerio` which has jQuery like syntax for crawling the raw text DOM.

Totally agree!

I've done this method a lot. Honestly scraping Google Reviews was the most difficult in terms of complexity. This was like 6 or 7 years ago. You would get back these huge nested arrays that mostly had 0s in them. Occasionally a value would be set and that's what I would go with. I'm assuming their internal tools were obfuscated and/or using protobuf. But it certainly took me back to the good ol' days hexediting games in order to make your own cheat codes.

Another difficulty I faced were sites that relied on the previous UI state to pass the API call. You'd have to emulate "real" browsing by requesting the subsequent pages and get the ID number. Still much faster than emulating the whole browser via Selenium.

Honestly, it was the small sites that actually proved more troublesome. The ones that had an actual admin reading logs. They would ban our whole IP Block, then ban our whole proxy IP Block. Once I implemented TOR functionality into our scraper for a particularly valuable but small site and they blocked that too. This site ended up implementing ludicrous rate limiting that had normal users waiting for 2-3 seconds between requests, all because we were scraping their data. I kid you not, by the time we gave up trying, this Section-8 rental site for a small city had vastly more protections in place than Zillow and Apartments.com combined.

I was thinking you talking about Instagram :) i knows guy who has similar problems with them.
Did you ever approached them?
This was my approach too and it's been working great. Nowadays data isn't rendered directly into HTML anymore, it gets downloaded from some JSON API endpoint. So I use network monitoring tools to see where it's coming from and then inferface with the endpoint directly. I essentially wrote custom clients for someone else's site. One of my scrapers is actually just curl piped into jq. Sometimes they change the API and I have to adapt but that's fine.

> I understand companies can put roadblocks to hinder this

Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

> Can you elaborate? I haven't run into any roadblocks yet but I'm not scraping big sites or sending a massive number of requests.

Cloudflare Bot Protection[1] is a popular one. The website is guarded by a layer of code that needs to be executed before continuing. Normal browsers will follow through. It can be hard to bypass.

[1]: https://www.cloudflare.com/pg-lp/bot-mitigation-fight-mode/

I have a codebase that defeats cloudflare protection. Felt like I had keys to kingdom.
So that would break text browsers too, right? :(

And users with JS disabled for privacy reasons.