| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mason55 2620 days ago

Yes, in my mind this is the key of a data lake. Take all your raw data and store it somewhere, then provide ways for people to access and query the raw data.

This means ingestion is faster (no transformation) and you don't throw away any data that you might want later. If multiple teams want to query the same data in different ways they have the ability to do so. And ideally it prevents data silos because everyone can stuff their raw data into a master data lake and each team has access to all the data but is responsible for doing the work to make it look like they want.

Reality of the above obviously doesn't always match the theory but schema-on-read/ELT are the easiest ways to handle the above. Typically this involves some kind of Hadoop-style technology, like Hive or SparkSQL for SQL-based querying, Spark for non-SQL, etc. But you've always got the raw data and can go back and re-ELT it from the data lake if your needs change.