Hacker News new | ask | show | jobs
by at_a_remove 2333 days ago
It's pretty straightforward, really.

I construct the .sqlite database from scratch each time in Python, building out table after table as I like it.

Some configuration data is loaded in from files first. This could be some default values or even test records for later injection.

The input data is loaded into the appropriate tables and then indexed as appropriate (or if appropriate). It is as "raw" as I can get it.

Each successive transformation occurs on a new table. This is so I can always go back one step for any post-mortem if I need to. Also, I can reference something that might be DELETEd in an a later table.

Often (and this is task-dependent), I will have to pull in data from other server-based databases, typically the target. They get their own tables. Then I can mark certain records as not being present in the target database, so they must be INSERTed. If a record is not present in my input and is there in the target, that would suggest a DELETE. Finally, I can compare records where some ID is present in my input and my .sqlite, they might be good for an UPDATE. All of this is so I can make only the changes that need to be made. Speed is not important to me here, only understanding what changes needed to be made and having a record of what they were and why.

I am happy to say that an ETL process I wrote using this general method back around 2009 is probably still running. I haven't had to touch it in years. Occasionally I will receive questions as to "why did this happen?" and I can just start running queries on the resultant .sqlite database file, kept with the logs, for answers.

Similarly, I can use these sorts of techniques when I am analyzing other datasets. The value here is that I can just refresh one table when the relevant data comes in, rather than having to run the ingest process for everything all over again. This can save me a lot of time.

1 comments

Awesome - elegantly simple using very common technologies.
I am not a very talented programmer so I stick very close to what is common, standard, and easy to understand. It usually means I am on the downslope of the hype cycle and it limits some opportunities but I have become okay with that.

I have gotten some CS students who were about to shoot flies with various cannons turned on to SQLite. I kept a couple of the decent books about it nearby and would shove it into their hands at that point. Usually a week later they would be raving about it.

Do you still have the titles of those books at hand? I'd love to take a look at them.
They are The Definitive Guide to SQLite by Mike Owens and Using SQLite by Jay A. Kreibich. I am quite sure they are more book than I needed, I only plumbed a fraction of SQLite's immense capabilities.