Hacker News new | ask | show | jobs
by hinkley 2333 days ago
Do you generate the file from scratch every time or do you modify the previous one as new data arrives?
2 comments

Depends on what you want... if you have a separate db project, you can have the output of that project be a clean database for testing other things, or a set of migration scripts for existing deployments.

I've been working on doing similar with containerized dababase servers for testing, while still having versioned scripts for prod (multiple separate deployments).

It is a bit of a hybrid.

In the early stages of development of whatever the ETL process is, I keep the database and just empty it out each time. As I got more of a sense of what I needed, I started DROPing my TABLEs more often and remaking them. Eventually I would make the whole database from scratch once I was along the way and had most everything fleshed out.

Ok. So each export is a full dump, not a delta on a previous one.

Do you anticipate hitting a wall at some point where the total time becomes a problem?

Well, it depends on the process. Some were full dumps, some were deltas pushed up to the final database, sometimes both (this product in particular had a load from file capability that you were supposed to use but some edge cases that were not well-addressed).

No, the time never grew significantly.

For one of the analysis projects, just one step of the analysis was quite time consuming but it would have been that way no matter what. SQLite allowed me to let it grind away overnight (or even over a weekend) on a workstation without tormenting production servers.