Hacker News new | ask | show | jobs
by ryantuck 2429 days ago
I've recently dug into the Agile DWH Design and The DWH Toolkit books and the design tips in them all seem really compelling, for good reason. Though as I've actually started modeling, I've found that the creation of "proper" fact/dimension tables has felt at times like overkill, given the technologies we're using (BQ / Postgres / Looker).

So, perfect timing! Really looking forward to checking this out.

2 comments

Yeah exactly. Much of the "overkill" was done because of performance and cost reasons, that frankly just don't apply anymore. Now the largest expense by far is time.

There are a number of people starting to talk about Star Schemas having little gain on modern stacks. The performance and costs gains are automatically done now with C-Store warehouses. Fivetran has a great post on this https://fivetran.com/blog/obt-star-schema

The reason a Star Schema exists is because it reduces the amount of data stored and it reduces the load times (faster updates) because only the facts and measures get updated. There is also little redundancy in Master Data (dimensions).

in the world of fixed assets (on-premise data warehouses) this was a concern, I'm curious to see how this plays out in these Cloud Providers because to me it seems like they will be more than happy to rent you as much space as you want and everyone is happy until the bill comes due.

Doing materialized, flat tables everywhere is great for reporting performance but the tables will not be updated as quickly, there will be redundancy in storage and it will be difficult to sync time dependent dimensions.

This is what I mean by updates due to C-Store warehouse engines (Redshift, BigQuery, Snowflake, etc). It's not just that the cloud providers are happy to rent you more space - it's also that they're C-Store and because of that redundancy in columns is well compressed automatically.

We try to explain some of that here - https://dataschool.com/data-modeling-101/row-vs-column-orien...

And Fivetran did a great benchmarking of it here - https://fivetran.com/blog/obt-star-schema

The architecture of the C-Store warehouses often removes the benefits of materialized views. This is why for a very long time Redshift didn't even support them - they insisted they weren't needed as they didn't improve performance over regular redshift significantly.

Rediscovering the wheel much ? You guys are starting again like 30 years ago, with simple queries and think all that was built is overkill. When you mature and get into complex analytics and joining data from 20 sources you will understand all that was built. Just you wait...
We're definitely not trying to start from scratch or throw out all the old knowledge/practices - just update them for the common data stacks used today.

In the book we use much of the old terms and recommendations. Most of the high level organizing is still totally right - but a lot of the optimizing and work done for performance and cost reasons is very different now.

For example ELT makes now much more sense than ETL for the reasons Kostas wrote about here: https://dataschool.com/data-governance/etl-vs-elt/

And many things previously done for cost and performance reasons are just not relevant anymore thanks to the big innovations in C-Store warehouses.

ELT has been around for 15+ years. Inmon refers to it as a Persistent Staging Area (PSA) in an Enterprise Data Warehouse.

The difference now is, you have Hadoop and cloud providers that will take credit cards and give you as much space as you can pay for. The concept is not new, it was just cost was a factor back then because capacity was fixed and memory was expensive.

the only thing that has changed is the commoditization of hardware has allowed for different behaviors that would have been cost prohibitive.

Yup! And that commoditization of hardware has made it really inexpensive to have a Data Lake, where you first put all your data in raw format (so you only need to do EL - and not T all in the same step). And then, because of the way C-Store sources like Redshift are built it makes a ton of sense to just do your T step as a set of Views (materialized or not) onto of that Data Lake.

It allows you to not do E & T & L all together. It's really nice (less complex, easier to implement, less costly, and more flexible) to have that T part pulled out and done after.

C-store's 1) improve aggregation performance since values are continuous on disk/memory for the same column and 2) benefit from bitmap indices when not doing range queries. Why do you think c-stores make a difference for the T stage? Because the performance overhead makes views viable?
Yes, and also the whole process is much simpler to pull the T out. The reason T had to be done at the same time as E&L was because of those storage and performance costs. Now you don't have to - and it separates the stages and simplifies.

The T being after the L means you can do that stage more simply in just SQL (with views or materialized views - possibly with the help of DBT), as opposed to some vendor interface, or python/R/etc script.

It also means that rebuilding your warehouse is much less significant of an ordeal. When the structure of source data changes or if you want to make some migration of the schemas of the warehouse you don't need to also re-run your ETL jobs and start over from scratch.

No one does just "ETL" anymore - ELT/ETL are more or less synonymous now. I can't remember the last time I talked to a team that was throwing data away anymore because of storage constraints. It's a straw man in 2019.

C-store doesn't solve for denormalization or provide any advantages when you're copying data all over the place to avoid joins.

1. It is - and that's why I chose that example (a less controversial one) here.

And though ELT is a very common standard now - I don't know of a single book that recommends it or explains why that change has happened. Just one of the reasons for writing a new data book.

2. C-Store does largely solve pervious performance and cost issues with denormalization. We also write a bit in this book (more to come) on how to avoid doing all that copying.