Hacker News new | ask | show | jobs
by layer8 906 days ago
I never understood what is meant by “data lake” in the first place, other than “heterogenous collection of large-ish data files”.
3 comments

During the big data hype, I did a feasibility study for this for my previous company (typical bigcorp) and learned that most companies should not have a data lake.

A data lake is a collection of different types of structured/unstructured data like CSV, Parquet, text, images, etc. stored in an object store or some such that in principle you're able to query. The theory is that you can just dump stuff into a kitchen drawer (ELT instead of ETL) and be able to do analytics on it later.

But most enterprises already have huge investments in relational databases (SQL Server, Oracle etc.) which are decades-old optimized, typed with schema, structured engines for storing data. If you have a SQL database, chances are you already have data in the right format for analytics and building a data lake is the wrong way to go.

People in tech companies have this wrong impression that enterprises have a lot of big data, but the fact is, most of the valuable data in most companies are less than a few terabytes total. They're mostly ERP data, Excel files, and operational data from various sensors (if that).

To unstructure the (already structured) data just so it can fit into the data lake seemed like the wrong strategy, but I was surprised how much companies like Cloudera and others hyped it up so much so they could sell technologies like Hive, Spark, Presto, etc. (and streaming tech like Kafka). These are overkill for most enterprises.

Yes, a typical DWH spends a lot of cycles trying to create a single consistent interpretation of the raw data, a data lake is just this raw data, plus whatever ad hoc interpretations of it your data analysts create.

A lakehouse is basically an attempt to get most of the DWH benefit by just making these ad hoc intepretations incremental.

That's pretty much it.