Hacker News new | ask | show | jobs
by cdcarter 1715 days ago
In my experience, a data warehouse usually has an ETL process at the beginning. Data comes in from disparate sources and on a regular basis, it is ETLd into a shape that is ready to use by the business. On the other hand, a data lake slurps in all the data as soon as it is available, in whatever form it is in. You have to process it into the business-consumable form when you query/egress it, but you don't have to know your dream schema up front.
1 comments

My experience is similar: extract process -> raw data -> clean/merge -> model

Normally you extract from source, then load to destination. There is no business logic in this process.

From raw you do all of your transforms to get clean up and merge and then get it into a usable model. With big data sets I've done wtih Hadoop and then moved the clean/merged data to a standard or MPP DB for analysts. For normal sets this can all be done in a standard DB.

The other part is all the data is available from raw and clean/merge for analysts to use and is kept. With the thinking the storage cost are extremely low and heading to zero. Whereas in traditional DW analysts used only the modeled sets and depending on the data earlier sets are deleted as they are for operational purposes only. Storage is considered expensive and limiting.

The move to ELT and using a declarative dataops tool has been mind bending and has been a multiplier in terms of speed to get to something usable. I don't want to see another DW again.