Hacker News new | ask | show | jobs
by laichzeit0 1923 days ago
Disagree with the swamp part. See it as a staging area for the warehouse and data science. If you use your data lake as the source for your warehouse then you’re forced to keep it clean. Also, for data science use cases you need access to the raw data and the what’s in the warehouse, much easier if the data lake can be used as opposed to essentially building one anyway.
1 comments

I have 100% seen data swamps in the wild:

> We see customers creating big data graveyards, dumping everything into Hadoop distributed file system (HDFS) and hoping to do something with it down the road. But then they just lose track of what’s there. The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.

https://en.wikipedia.org/wiki/Data_lake

> If you use your data lake as the source for your warehouse then you’re forced to keep it clean

So you apply a schema...? Determining what should and should not live in your storage? What format it should be in etc.?

That's not a data lake in my mind -- it's just the ETL staging area.

This is why I dislike the term data lake. It's so vague and loose that it could mean 300 different things to 300 different experts. Makes it a great term for snake oil marketing though.