Hacker News new | ask | show | jobs
by dekhn 1681 days ago
sure, but if I see the term 'data lake' I'm gonna Bing it, with the first result being https://aws.amazon.com/big-data/datalakes-and-analytics/what... which explains it nicely.

ELI5 is for reddit, generally here we expect you can google it to get the ELI5 explanation before giving us your hot take in a comment

3 comments

Yeah, that's exactly the kind of content I found unsuitable when I did a web search for the term. It spends a whole two sentences giving an explanation that tells me very little about how data lakes are anything more specific than a cloud-hosted database solution, and moves on to

> Organizations that successfully generate business value from their data, will outperform their peers.

at which point I'm like

> ok, I'm reading a covert advertisement about Fancy Cloud Technology aimed at some kind of big-spending manager, which is unlikely to tell me meaningfully what this actually is

and I'm out. I was looking for content that was in a more neutral, purely educational genre, and wondering what collection of non-cloud analogues it replaces/is composed of. Someone writing in the comments

> I used it to transform several terabytes of JSON into nice relational data for analysts without too much effort

is way, way more direct and helpful than mentioning that 'unlike data warehouses, data lakes support non-relational data'. Like great, it's a cloud thing that supports a variety of databases. But what is it?

> before giving us your hot take in a comment

I didn't give any take at all? I just really found all the sources that came up on the first page of search results to be almost in the wrong genre for me, and expected (correctly) that people on this site would be able to produce descriptions in 1-5 sentences that worked way better for me.

Pretty much all of the answers I got here were really good, and I'm glad I asked.

> What is a data lake?

> A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

This may be self-explanatory for you, but what it means in practice is not as self-evident as you believe. For all it describes, it could be an FTP upload directory that loads things into an sqlite database. It's not until the scale is invoked (multi-terabyte/day) that the inadequacies of a naive solution become apparent. For those in that area of the industry, Snowflake is already known. (Seriously, if you're running into issues with limitations of RedShift, it behooves you to take a look at Snowflake.) For those that aren't, data warehousing is unfamiliar, never mind data lake. For those outside the ML sphere, the finer points of training runs are also non-obvious.

It’s probably just me but the distinction between datalake and data warehouse seems like splitting hairs. Unstructured data can always be stored on structure databases. What’s the main reason for both to coexist?
History matters here and I don't know how well this is documented, but: data warehouses have been around since the 70s or so, data lake is a newer term. Data warehouses came from an era where nearly all data was stored in the database itself (typically Oracle), owned and controlled by a single or few groups, and there were only a few databases, which were the source of truth (the two databases would normally be a transaction engine handling real time load (just what's required to authorize a credit card transaction, for example), and a "warehouse" which contained all the long-term data like every transaction that had ever occurred.

Data lakes are more modern and came about as people realized they had 30 databases and the business wanted to do queries against all of them simultaneously (IE, join your credit card transaction history with historical rates of default in a zip code), quickly. The data warehouse solution was to use federated database queries (JOINs across databases), or force everybody to consolidate. A data lake is a single virtual entity that represents "all your data in one place".

It's based on a weak analogy where a warehouse is a place where you put stuff in very well organized locations while a lake is a place where a bunch of different waters slosh together.

Storing unstructured data in a database is dumb because databases cost about 10X storage space due to indexing, while unstructured data often can just sit around passively in a filesystem (and/or have a filesystem index built into it for fast queries).

I view this through the lens of web tech, for example, see the wars between the mapreduce and database people and how Google evolved from MapReduce against GFS to Flumes against Spanner, showing we just live in an endless cycle of renaming old technology.

It's absolutely correct that the terminology doesn't map perfectly

This was really helpful, too. Thanks!
It used to be that way. Old data warehouses (built on relational dbs) couldn't handle large scale data, and old data lakes used to be hard to use (write a map-reduce job to query data).

It is barely true nowadays.

i worked at excite.com right after the IPO, and front and center in the HQ building was a MASSIVE glass wall showcasing the oracle data warehouse machine room.

i didn't enjoy working w/either the datastore directly, or the DBA team that ran it either. an early, more old-white-dude "i just want to serve 5T"