| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mnmkng 2059 days ago
	Actually, if you're scraping at any scale above a hobby project, most of your web scraping hours would now be spent on avoiding bot detection, reverse engineering APIs and trying to make HTTP requests work where it seems only a browser can help. The time spent "working with strings" is not even noticeable to me. I scrape for a living and I work with JS, because currently, it has the better tools.

2 comments

woodpanel 2059 days ago

I can echo this.

I'm currently working to turn my hobby scraper into something profitable. "Working with strings" is already the least of my concern. I've spent most of the time with finding an architecture / file structure that allows me to

- easily handle markup changes on source-pages and

- quickly integrate new sources

I've feared it would be impossible to handle unexpected structural changes from a multitude of sources. Turns out that rarely happens. Like, once every x years per source's page-type.

tracker1 2059 days ago

I'm not sure if puppeteer/playwrite might be a better option for some of those instances, with real Chrome.

I prefer JS + jsdom/cheerio as it's closer to the in-browser experience for scraping.