| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 0cf8612b2e1e 637 days ago
	A personal rule of mine is to always separate data receipt+storage from parsing. The retrieval is comparatively very expensive and has few possible failure modes. Parsing can always fail in new and exciting ways. Disk space to store the returned data is cheap and can be periodically flushed only when you are certain the content has been properly extracted.

2 comments

cjonas 636 days ago

Did you mean "retrieval is comparatively inexpensive"? I think I'm on the same page but this threw me off.

link

franga2000 636 days ago

I read it as retrieval being the requests to the scraped site. I can parse a few thousand HTML pages in minutes, but fetching them in the first place takes hours.

link

0cf8612b2e1e 636 days ago

Exactly what I intended. Scraping is slow (and may be an irreplaceable snapshot in time). Parsing is fast and repeatable so should be done in a separate process from a stored copy.

link

erichocean 637 days ago

I ended up with the same design after encountering numerous exotic failure modes.

link