Hacker News new | ask | show | jobs
by supercanuck 2429 days ago
ELT has been around for 15+ years. Inmon refers to it as a Persistent Staging Area (PSA) in an Enterprise Data Warehouse.

The difference now is, you have Hadoop and cloud providers that will take credit cards and give you as much space as you can pay for. The concept is not new, it was just cost was a factor back then because capacity was fixed and memory was expensive.

the only thing that has changed is the commoditization of hardware has allowed for different behaviors that would have been cost prohibitive.

1 comments

Yup! And that commoditization of hardware has made it really inexpensive to have a Data Lake, where you first put all your data in raw format (so you only need to do EL - and not T all in the same step). And then, because of the way C-Store sources like Redshift are built it makes a ton of sense to just do your T step as a set of Views (materialized or not) onto of that Data Lake.

It allows you to not do E & T & L all together. It's really nice (less complex, easier to implement, less costly, and more flexible) to have that T part pulled out and done after.

C-store's 1) improve aggregation performance since values are continuous on disk/memory for the same column and 2) benefit from bitmap indices when not doing range queries. Why do you think c-stores make a difference for the T stage? Because the performance overhead makes views viable?
Yes, and also the whole process is much simpler to pull the T out. The reason T had to be done at the same time as E&L was because of those storage and performance costs. Now you don't have to - and it separates the stages and simplifies.

The T being after the L means you can do that stage more simply in just SQL (with views or materialized views - possibly with the help of DBT), as opposed to some vendor interface, or python/R/etc script.

It also means that rebuilding your warehouse is much less significant of an ordeal. When the structure of source data changes or if you want to make some migration of the schemas of the warehouse you don't need to also re-run your ETL jobs and start over from scratch.