|
|
|
|
|
by lrobinovitch
1725 days ago
|
|
As I understand it, a data lake is a storage space for unstructured data. A data warehouse is a storage + compute layer, usually with data sourced from a data lake, that is ready for querying. This understanding comes from the description in this paper[1] > To solve these problems, the second generation data analytics platforms started offloading all the raw data into data lakes: low-cost storage systems with a file API that hold data in generic and usually open file formats, such as Apache Parquet and ORC [8, 9]. This approach started with the Apache Hadoop movement [5], using the Hadoop File System (HDFS) for cheap storage. The data lake was a schema-on-read architecture that enabled the agility of storing any data at low cost, but on the other hand, punted the problem of data quality and governance downstream. In this architecture, a small subset of data in the lake would later be ETLed to a downstream data warehouse (such as Teradata) for the most important decision support and BI applications. [1] http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf |
|