Hacker News new | ask | show | jobs
by uvdn7 1676 days ago
It’s probably just me but the distinction between datalake and data warehouse seems like splitting hairs. Unstructured data can always be stored on structure databases. What’s the main reason for both to coexist?
2 comments

History matters here and I don't know how well this is documented, but: data warehouses have been around since the 70s or so, data lake is a newer term. Data warehouses came from an era where nearly all data was stored in the database itself (typically Oracle), owned and controlled by a single or few groups, and there were only a few databases, which were the source of truth (the two databases would normally be a transaction engine handling real time load (just what's required to authorize a credit card transaction, for example), and a "warehouse" which contained all the long-term data like every transaction that had ever occurred.

Data lakes are more modern and came about as people realized they had 30 databases and the business wanted to do queries against all of them simultaneously (IE, join your credit card transaction history with historical rates of default in a zip code), quickly. The data warehouse solution was to use federated database queries (JOINs across databases), or force everybody to consolidate. A data lake is a single virtual entity that represents "all your data in one place".

It's based on a weak analogy where a warehouse is a place where you put stuff in very well organized locations while a lake is a place where a bunch of different waters slosh together.

Storing unstructured data in a database is dumb because databases cost about 10X storage space due to indexing, while unstructured data often can just sit around passively in a filesystem (and/or have a filesystem index built into it for fast queries).

I view this through the lens of web tech, for example, see the wars between the mapreduce and database people and how Google evolved from MapReduce against GFS to Flumes against Spanner, showing we just live in an endless cycle of renaming old technology.

It's absolutely correct that the terminology doesn't map perfectly

This was really helpful, too. Thanks!
It used to be that way. Old data warehouses (built on relational dbs) couldn't handle large scale data, and old data lakes used to be hard to use (write a map-reduce job to query data).

It is barely true nowadays.

i worked at excite.com right after the IPO, and front and center in the HQ building was a MASSIVE glass wall showcasing the oracle data warehouse machine room.

i didn't enjoy working w/either the datastore directly, or the DBA team that ran it either. an early, more old-white-dude "i just want to serve 5T"