Hacker News new | ask | show | jobs
by dmn001 3154 days ago
There is no issue with parsing and scraping in the same loop as long as there is caching in there as well. You don't want to be hitting the server repeatedly whilst you're debugging.

A project like Scrapy should have caching on by default, but it seems to be an afterthought. Repeatable and reproducible parsing of cached websites is necessary, e.g. if you find additional data fields that you want to parse without downloading the entire site over again.

1 comments

I think the bigger point is the benefit of storing pulled data as is for the future, not so much about hitting the server multiple times. If so, I agree with this 100% -- being able to re-run your algorithms later on a local dataset is a powerful capability. Later time, different computer, new software version -- no problem, you have a local copy of the data.

With caching, you are at the mercy of whatever third party caching scheme is used under the hood and raw pulled data can disappear any time without your explicit command (e.g., if some library gets updated and decides that this invalidates the caching scheme).

By caching, I just mean storing of data locally so you don't have to request it again under a certain timeframe. I use my own caching scripts written in Python, if you use a 3rd party library then data deletion does not matter too much either if you configure it properly and backup the data - html/json data compresses really well using lzma2 in 7-zip.