|
|
|
|
|
by Joeri
2620 days ago
|
|
I wouldn't say you don't have schema's, rather you have schema-on-read instead of schema-on-write, and you use an extract-load-transform pattern instead of extract-transform-load. The data is replicated as-is into the data lake and only then do you figure out what to do with it. |
|
This means ingestion is faster (no transformation) and you don't throw away any data that you might want later. If multiple teams want to query the same data in different ways they have the ability to do so. And ideally it prevents data silos because everyone can stuff their raw data into a master data lake and each team has access to all the data but is responsible for doing the work to make it look like they want.
Reality of the above obviously doesn't always match the theory but schema-on-read/ELT are the easiest ways to handle the above. Typically this involves some kind of Hadoop-style technology, like Hive or SparkSQL for SQL-based querying, Spark for non-SQL, etc. But you've always got the raw data and can go back and re-ELT it from the data lake if your needs change.