Hacker News new | ask | show | jobs
by mnmkng 2059 days ago
Actually, if you're scraping at any scale above a hobby project, most of your web scraping hours would now be spent on avoiding bot detection, reverse engineering APIs and trying to make HTTP requests work where it seems only a browser can help. The time spent "working with strings" is not even noticeable to me.

I scrape for a living and I work with JS, because currently, it has the better tools.

2 comments

I can echo this.

I'm currently working to turn my hobby scraper into something profitable. "Working with strings" is already the least of my concern. I've spent most of the time with finding an architecture / file structure that allows me to

- easily handle markup changes on source-pages and

- quickly integrate new sources

I've feared it would be impossible to handle unexpected structural changes from a multitude of sources. Turns out that rarely happens. Like, once every x years per source's page-type.

I'm not sure if puppeteer/playwrite might be a better option for some of those instances, with real Chrome.

I prefer JS + jsdom/cheerio as it's closer to the in-browser experience for scraping.