| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fragmede 1680 days ago

> What is a data lake?

> A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

This may be self-explanatory for you, but what it means in practice is not as self-evident as you believe. For all it describes, it could be an FTP upload directory that loads things into an sqlite database. It's not until the scale is invoked (multi-terabyte/day) that the inadequacies of a naive solution become apparent. For those in that area of the industry, Snowflake is already known. (Seriously, if you're running into issues with limitations of RedShift, it behooves you to take a look at Snowflake.) For those that aren't, data warehousing is unfamiliar, never mind data lake. For those outside the ML sphere, the finer points of training runs are also non-obvious.