Hacker News new | ask | show | jobs
by debo_ 906 days ago
I've heard of data lakes, but "data lakehouse" sounds like where upper class data goes in the summer to take their data-boats data-fishing.
3 comments

The name is easy to poke fun at, but I think it’s a real problem. A lot of companies use data lakes to store data and warehouses to serve BI to tools like Tableau or PowerBI. They then up copying data between the two.

Querying a lake directly and having transactions, governance etc against one set of data (a data Lakehouse) can really simplify the stack and take out cost.

Ah, so the house part comes from warehouse. Not obvious to say the least.
I never understood what is meant by “data lake” in the first place, other than “heterogenous collection of large-ish data files”.
During the big data hype, I did a feasibility study for this for my previous company (typical bigcorp) and learned that most companies should not have a data lake.

A data lake is a collection of different types of structured/unstructured data like CSV, Parquet, text, images, etc. stored in an object store or some such that in principle you're able to query. The theory is that you can just dump stuff into a kitchen drawer (ELT instead of ETL) and be able to do analytics on it later.

But most enterprises already have huge investments in relational databases (SQL Server, Oracle etc.) which are decades-old optimized, typed with schema, structured engines for storing data. If you have a SQL database, chances are you already have data in the right format for analytics and building a data lake is the wrong way to go.

People in tech companies have this wrong impression that enterprises have a lot of big data, but the fact is, most of the valuable data in most companies are less than a few terabytes total. They're mostly ERP data, Excel files, and operational data from various sensors (if that).

To unstructure the (already structured) data just so it can fit into the data lake seemed like the wrong strategy, but I was surprised how much companies like Cloudera and others hyped it up so much so they could sell technologies like Hive, Spark, Presto, etc. (and streaming tech like Kafka). These are overkill for most enterprises.

Yes, a typical DWH spends a lot of cycles trying to create a single consistent interpretation of the raw data, a data lake is just this raw data, plus whatever ad hoc interpretations of it your data analysts create.

A lakehouse is basically an attempt to get most of the DWH benefit by just making these ad hoc intepretations incremental.

That's pretty much it.
Naming is hard, I hope the industry can come up with something better eventually.

It is definitely jarring in my head every time I hear it or read it.

I prefer the name, as I find there's a straight line correlation between stupid names like "Data Lakehouse" and bad engineering practices. Another sign is the dumber the buzzword, the more consultants and middle men exist to skim money off a problem that should never exist in the first place.