Hacker News new | ask | show | jobs
by Joeri 2620 days ago
I wouldn't say you don't have schema's, rather you have schema-on-read instead of schema-on-write, and you use an extract-load-transform pattern instead of extract-transform-load. The data is replicated as-is into the data lake and only then do you figure out what to do with it.
2 comments

Yes, in my mind this is the key of a data lake. Take all your raw data and store it somewhere, then provide ways for people to access and query the raw data.

This means ingestion is faster (no transformation) and you don't throw away any data that you might want later. If multiple teams want to query the same data in different ways they have the ability to do so. And ideally it prevents data silos because everyone can stuff their raw data into a master data lake and each team has access to all the data but is responsible for doing the work to make it look like they want.

Reality of the above obviously doesn't always match the theory but schema-on-read/ELT are the easiest ways to handle the above. Typically this involves some kind of Hadoop-style technology, like Hive or SparkSQL for SQL-based querying, Spark for non-SQL, etc. But you've always got the raw data and can go back and re-ELT it from the data lake if your needs change.

I don't know much about the strict definition, but that's how I use them. I have had several clients that want to analyze data they didn't capture in their schema. I'd say: disk is cheap. Throw everything in there (medical records, events, etc.). If we need it later, we'll fish it out. Ugly, but simple.