| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 1409 days ago

Yeah, my first step in trying to scrape an SPA is always to hit the network tab in the browser, filter by type=JSON and then sort by size. The largest responses are often the most useful, and can then be grabbed with curl.

Sometimes though that's not enough - particularly on older sites that might use weirder concepts like ASP.NET View state. For those I find having Playwright around is a big benefit.

Generally the things I have the most trouble with for non-browser-automation scraping are things with complex state stored in cookies and URL fragments (and maybe even localStorage these days).

1 comments

real-dino 1409 days ago

I'm struggling with a site at the moment for exactly this reason, it requires an auth key to return the products.

Puppeteer had it working, but if I need to do this in a google firebase function, it would be so much better to get it to simple fetch requests.

For the e-commerce sites I'm looking at though, most just are server side rendered, which is so much easier. In that case i use `cheerio` which has jQuery like syntax for crawling the raw text DOM.

link