| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 1vuio0pswjnm7 1411 days ago

"It's increasingly difficult these days to write scrapers that don't at some point need to execute Javascript on a page - so you need to have a good browser automation tool on hand."

But doesn't this assume which sites are being "scraped". How would anyone know what sites someone else needs to "scrape" unless people name the sites (and the specific pages at the sites as this is not "crawling"). For example, none of the websites with webpages I extract data from require me to use Javascript, i.e., I can retrieve and extract data without using JS.

Also, it is possible to automate text-only browsers that do not run Javascript. "Browser automation" is not necessarily just for Javascript.

Maybe we should have a "scraping challenge" in an effort to provide some evidence on this question. The challenge could be to "scrape" every webpage currently submitted to HN,^1 without using Javascript.^2

If someone manages to scrape a majority of the pages submitted to HN without JS, then we have some evidence that, for HN readers, JS and therefore Javascript-enabled browser automation is generally _not_ required for "scraping".

1. The problem with I see using something more generic like majestic_million.csv is that it is a list of domain names not webpages.

2. We would likely need to agree on what data would need to be extracted from each submitted page.

1 comments

simonw 1411 days ago

I'll rephrase:

"It's increasingly difficult these days to regularly write scrapers for a large range of different websites without eventually hitting a situation where you need to execute JavaScript on a page"

link