Hacker News new | ask | show | jobs
by gitgud 2059 days ago
A great web-scraping architecture is the pipeline model similar to 3D rendering pipelines. |Stage 1|: Render and HTML, |Stage 2| Save HTML to disk, |Stage 3|: Parse and translate HTML to whatever output you need; JSON, CSV etc...

It's great if each of these processes can be invoked separately, so that after the HTML is saved, you don't need to redownload it, unless the source has changed.

By dividing scraping into; rendering, caching and parsing you save your self a lot of web requests. This also helps prevent the website from triggering IP-blocking, DDOS protection and Rate-limiting.