| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by di456 1267 days ago

> My project naturally has 2 month "iterations" because that's how often I do a new crawl and load crawl data

I have a similar project type and timeline, and going through one of these iterations now. Could you share how you persist your data in between iterations?

For mine, these were some of the bigger iterations that were spaced out over several months:

1. I started out by writing raw http API responses to python pickle files. Scraping via api.

2. Then wrote some analysis logic in python that read the pickle files, and outputted to a csv that I could use for analysis in spreadsheets.

3. Scaled up to a couple orders of magnitude more data size, and multiple datasets. Added logic to bulk output any dataset to flattened CSV, then bulk import the CSV's to sqlite tables.

For each of those I had to rethink the code, simplify it, and make some parts generic where there was a common need.

I think my next big push will be loading new data to SQLite incrementally, and figuring out how and where to persist the SQLite data cheaply. Right now it's local but too big to check into a GitHub repo.