| HN Mirror

During the big data hype, I did a feasibility study for this for my previous company (typical bigcorp) and learned that most companies should not have a data lake.

A data lake is a collection of different types of structured/unstructured data like CSV, Parquet, text, images, etc. stored in an object store or some such that in principle you're able to query. The theory is that you can just dump stuff into a kitchen drawer (ELT instead of ETL) and be able to do analytics on it later.

But most enterprises already have huge investments in relational databases (SQL Server, Oracle etc.) which are decades-old optimized, typed with schema, structured engines for storing data. If you have a SQL database, chances are you already have data in the right format for analytics and building a data lake is the wrong way to go.

People in tech companies have this wrong impression that enterprises have a lot of big data, but the fact is, most of the valuable data in most companies are less than a few terabytes total. They're mostly ERP data, Excel files, and operational data from various sensors (if that).

To unstructure the (already structured) data just so it can fit into the data lake seemed like the wrong strategy, but I was surprised how much companies like Cloudera and others hyped it up so much so they could sell technologies like Hive, Spark, Presto, etc. (and streaming tech like Kafka). These are overkill for most enterprises.