| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andrewflnr 1923 days ago
	Serious question: what's the difference? I've seen both of these terms a lot but never with a concrete definition. I get the impression neither one refers to a terribly precise concept.

5 comments

dijksterhuis 1923 days ago

Depends who you ask. Traditionally speaking:

# Data lake

Data is stored en masse with no schema applied, either unstructed or structured data can be dumped straight into the lake or can be transformed and then dumped in. Turns into a data swamp when it becomes unusable due to staleness or complexity.

Data lakes are basically an AWS S3 bucket business users can access and (attempt) to do reporting on.

# Data warehouse

Heavily structured schema applied to data used in reporting, usually defines the single point of truth for business purposes. Uses a star schema model (if you follow Kimball [0] methodology) to create dimension tables used to filter and aggregate raw measurements from the central fact tables (which contain your actual measures like £ made on 1 sale).

Kimball and Inmon [1] philosophies come with their own benefits and trade offs. See bottom of [2].

Edit: got methodolgies the wrong way round with initial costs, linked article has a useful table that I didn't see.

Data warehouses have a very concrete definition and are usually implemented via Kimball's or Inmon's method. When I've worked with them they've become the bastion of business reporting (excel users love a pivot table).

---

Just to confuse matters, there's also the data vault: https://en.m.wikipedia.org/wiki/Data_vault_modeling

0: https://en.m.wikipedia.org/wiki/Ralph_Kimball

1: https://en.m.wikipedia.org/wiki/Bill_Inmon

2: https://www.zentut.com/data-warehouse/kimball-and-inmon-data...

link

laichzeit0 1923 days ago

Disagree with the swamp part. See it as a staging area for the warehouse and data science. If you use your data lake as the source for your warehouse then you’re forced to keep it clean. Also, for data science use cases you need access to the raw data and the what’s in the warehouse, much easier if the data lake can be used as opposed to essentially building one anyway.

link

dijksterhuis 1923 days ago

I have 100% seen data swamps in the wild:

> We see customers creating big data graveyards, dumping everything into Hadoop distributed file system (HDFS) and hoping to do something with it down the road. But then they just lose track of what’s there. The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.

https://en.wikipedia.org/wiki/Data_lake

> If you use your data lake as the source for your warehouse then you’re forced to keep it clean

So you apply a schema...? Determining what should and should not live in your storage? What format it should be in etc.?

That's not a data lake in my mind -- it's just the ETL staging area.

This is why I dislike the term data lake. It's so vague and loose that it could mean 300 different things to 300 different experts. Makes it a great term for snake oil marketing though.

link

contravariant 1923 days ago

Most people I've met use 'datalake' to refer to a (categorised) collection of otherwise unprocessed data.

A data-warehouse is typically somewhat more structured and doesn't just collect data but also combines and links data from multiple sources. Typically with the goal of creating a set of tables that you can use for reporting without needing to know all the intricate details of how the source-data is linked.

A data-warehouse can be based on a datalake. You could also make a data-warehouse without first building a datalake but keeping the datalake part separate allows for better separation of concerns. You can also have datalake without building a data-warehouse on top of it, it depends on what you want to use it for.

link

milkytron 1923 days ago

Did a quick search because I was curious, this was the first result:

> Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms. A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.

link

andrewflnr 1923 days ago

In my defense, when I did my googling a while back, it was specifically about data lakes, not the comparison. :)

link

waynesonfire 1923 days ago

For starters, you won't hit the front page with one.

link

1996 1923 days ago

Some people want to deploy cool technology to hit the frontpage

Other people want to deploy robust, tested and tried solutions, that won't break in mysterious ways, just to make money.

I side with the later.

link

texasbigdata 1923 days ago

Sure I'll try: schema, Master data management, thoughtfullness of upstream and downstream usage, potentially optimization...

link