Hacker News new | ask | show | jobs
by thingsilearned 2429 days ago
We're definitely not trying to start from scratch or throw out all the old knowledge/practices - just update them for the common data stacks used today.

In the book we use much of the old terms and recommendations. Most of the high level organizing is still totally right - but a lot of the optimizing and work done for performance and cost reasons is very different now.

For example ELT makes now much more sense than ETL for the reasons Kostas wrote about here: https://dataschool.com/data-governance/etl-vs-elt/

And many things previously done for cost and performance reasons are just not relevant anymore thanks to the big innovations in C-Store warehouses.

2 comments

ELT has been around for 15+ years. Inmon refers to it as a Persistent Staging Area (PSA) in an Enterprise Data Warehouse.

The difference now is, you have Hadoop and cloud providers that will take credit cards and give you as much space as you can pay for. The concept is not new, it was just cost was a factor back then because capacity was fixed and memory was expensive.

the only thing that has changed is the commoditization of hardware has allowed for different behaviors that would have been cost prohibitive.

Yup! And that commoditization of hardware has made it really inexpensive to have a Data Lake, where you first put all your data in raw format (so you only need to do EL - and not T all in the same step). And then, because of the way C-Store sources like Redshift are built it makes a ton of sense to just do your T step as a set of Views (materialized or not) onto of that Data Lake.

It allows you to not do E & T & L all together. It's really nice (less complex, easier to implement, less costly, and more flexible) to have that T part pulled out and done after.

C-store's 1) improve aggregation performance since values are continuous on disk/memory for the same column and 2) benefit from bitmap indices when not doing range queries. Why do you think c-stores make a difference for the T stage? Because the performance overhead makes views viable?
Yes, and also the whole process is much simpler to pull the T out. The reason T had to be done at the same time as E&L was because of those storage and performance costs. Now you don't have to - and it separates the stages and simplifies.

The T being after the L means you can do that stage more simply in just SQL (with views or materialized views - possibly with the help of DBT), as opposed to some vendor interface, or python/R/etc script.

It also means that rebuilding your warehouse is much less significant of an ordeal. When the structure of source data changes or if you want to make some migration of the schemas of the warehouse you don't need to also re-run your ETL jobs and start over from scratch.

No one does just "ETL" anymore - ELT/ETL are more or less synonymous now. I can't remember the last time I talked to a team that was throwing data away anymore because of storage constraints. It's a straw man in 2019.

C-store doesn't solve for denormalization or provide any advantages when you're copying data all over the place to avoid joins.

1. It is - and that's why I chose that example (a less controversial one) here.

And though ELT is a very common standard now - I don't know of a single book that recommends it or explains why that change has happened. Just one of the reasons for writing a new data book.

2. C-Store does largely solve pervious performance and cost issues with denormalization. We also write a bit in this book (more to come) on how to avoid doing all that copying.