Hacker News new | ask | show | jobs
by strin 1956 days ago
That works only for static page though. Many modern pages would require you to run a selenium or puppetteer to scrape the content.
3 comments

For these sites, I crawl using a JS powered engine, and just save the relevant page content to disk.

Then I can craft my regex/selectors/etc., once I have the data stored locally.

This helps if you get caught and shut down - it won't turn off your development effort, and you can create a separate task to proxy requests.

I did web-scraping professionally for two years, in the order of 10M pages per day. The performance with a browser is abysmal and requires tonnes of memory so not financially viable. We used them for some jobs, but rendered content isn't a problem, you can also simulate the API calls (common) and read the JSON, or regex the script and try to do something with that.

I'd say 99% of the time you can get by without a browser.

Fully agree. It takes some thought :)
That's never required; the data shows up in the web page because you requested it from somewhere. You can do the same thing in your scraper.
> You can do the same thing in your scraper

Rendering the page in Puppeteer / Selenium and then scraping it from there sounds like a lot easier than somehow trying to replicate that in your scraper?

Sure. How does that relate to the claim that your scraper is actually unable to make the same requests your browser does?
How are you going to deal with values generated by JS and used to sign requests?
If they're really being generated client-side, you're free to generate them yourself by any means you want. But also, that's a strange thing for the website to do, since it's applying a security feature (signatures) in a way that prevents it from providing any security.

If they're generated server-side like you would expect, and sent to the client, you'd get them the same way you get anything else, by asking for them.

I'm not sure what's your point. Of course you can replicate every request in your scraper / with curl if you want to if you know all the input variables.

Doing that for web scraping purposes where everything is changing all the time and you have more than one target website is just not feasible if you have to reverse engineer some custom JS for every site. Using some kind of headless browser for modern websites will be way easier and more reliable.

>If they're really being generated client-side, you're free to generate them yourself by any means you want. But also, that's a strange thing for the website to do

what??

Page loads -> Javascript sends request to backend -> it returns data -> javascript does stuff with it and renders it.