Hacker News new | ask | show | jobs
by 0cf8612b2e1e 637 days ago
A personal rule of mine is to always separate data receipt+storage from parsing. The retrieval is comparatively very expensive and has few possible failure modes. Parsing can always fail in new and exciting ways.

Disk space to store the returned data is cheap and can be periodically flushed only when you are certain the content has been properly extracted.

2 comments

Did you mean "retrieval is comparatively inexpensive"? I think I'm on the same page but this threw me off.
I read it as retrieval being the requests to the scraped site. I can parse a few thousand HTML pages in minutes, but fetching them in the first place takes hours.
Exactly what I intended. Scraping is slow (and may be an irreplaceable snapshot in time). Parsing is fast and repeatable so should be done in a separate process from a stored copy.
I ended up with the same design after encountering numerous exotic failure modes.