| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dekhn 1676 days ago

History matters here and I don't know how well this is documented, but: data warehouses have been around since the 70s or so, data lake is a newer term. Data warehouses came from an era where nearly all data was stored in the database itself (typically Oracle), owned and controlled by a single or few groups, and there were only a few databases, which were the source of truth (the two databases would normally be a transaction engine handling real time load (just what's required to authorize a credit card transaction, for example), and a "warehouse" which contained all the long-term data like every transaction that had ever occurred.

Data lakes are more modern and came about as people realized they had 30 databases and the business wanted to do queries against all of them simultaneously (IE, join your credit card transaction history with historical rates of default in a zip code), quickly. The data warehouse solution was to use federated database queries (JOINs across databases), or force everybody to consolidate. A data lake is a single virtual entity that represents "all your data in one place".

It's based on a weak analogy where a warehouse is a place where you put stuff in very well organized locations while a lake is a place where a bunch of different waters slosh together.

Storing unstructured data in a database is dumb because databases cost about 10X storage space due to indexing, while unstructured data often can just sit around passively in a filesystem (and/or have a filesystem index built into it for fast queries).

I view this through the lens of web tech, for example, see the wars between the mapreduce and database people and how Google evolved from MapReduce against GFS to Flumes against Spanner, showing we just live in an endless cycle of renaming old technology.

It's absolutely correct that the terminology doesn't map perfectly

1 comments

pxc 1676 days ago

This was really helpful, too. Thanks!

link