| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by strin 1956 days ago
	That works only for static page though. Many modern pages would require you to run a selenium or puppetteer to scrape the content.

3 comments

edmundsauto 1956 days ago

For these sites, I crawl using a JS powered engine, and just save the relevant page content to disk.

Then I can craft my regex/selectors/etc., once I have the data stored locally.

This helps if you get caught and shut down - it won't turn off your development effort, and you can create a separate task to proxy requests.

link

alephu5 1956 days ago

I did web-scraping professionally for two years, in the order of 10M pages per day. The performance with a browser is abysmal and requires tonnes of memory so not financially viable. We used them for some jobs, but rendered content isn't a problem, you can also simulate the API calls (common) and read the JSON, or regex the script and try to do something with that.

I'd say 99% of the time you can get by without a browser.

link

inovica 1955 days ago

Fully agree. It takes some thought :)

link

thaumasiotes 1956 days ago

That's never required; the data shows up in the web page because you requested it from somewhere. You can do the same thing in your scraper.

link

dewey 1956 days ago

> You can do the same thing in your scraper

Rendering the page in Puppeteer / Selenium and then scraping it from there sounds like a lot easier than somehow trying to replicate that in your scraper?

link

thaumasiotes 1956 days ago

Sure. How does that relate to the claim that your scraper is actually unable to make the same requests your browser does?

link

dewey 1956 days ago

How are you going to deal with values generated by JS and used to sign requests?

link

thaumasiotes 1956 days ago

If they're really being generated client-side, you're free to generate them yourself by any means you want. But also, that's a strange thing for the website to do, since it's applying a security feature (signatures) in a way that prevents it from providing any security.

If they're generated server-side like you would expect, and sent to the client, you'd get them the same way you get anything else, by asking for them.

link

dewey 1956 days ago

I'm not sure what's your point. Of course you can replicate every request in your scraper / with curl if you want to if you know all the input variables.

Doing that for web scraping purposes where everything is changing all the time and you have more than one target website is just not feasible if you have to reverse engineer some custom JS for every site. Using some kind of headless browser for modern websites will be way easier and more reliable.

link

tester756 1956 days ago

>If they're really being generated client-side, you're free to generate them yourself by any means you want. But also, that's a strange thing for the website to do

what??

Page loads -> Javascript sends request to backend -> it returns data -> javascript does stuff with it and renders it.

link