| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rwhaling 1716 days ago

Always an interesting read. A ton of insights, but I always find it a bit hard to make sense of, and it feels so disconnected from a modern distributed systems context.

In retrospect, I think a lot of the Spark SQL Dataframe workflow comes pretty close to what D/Tutorial D aspire to - static typing, functions on relations, imperative style but fundamentally unordered, etc.; however, it's only a processing system, not a storage system.

I have kept my distance from the "data lake" buzzword circles, but maybe a transactional, Spark-based data lake does approximate what Darwen/Date are going for? The only thing really missing might be nested relations?

2 comments

darksaints 1716 days ago

I haven't read it but I understand where you're coming from here.

Does this doc talk about the problems with nullability / ternary logic? What about algebraic sum types? Those have always been some of the most difficult aspects of relational data modeling, at least with respect to SQL.

link

overkalix 1716 days ago

Again, what's the difference between a data warehouse and a data lake? Honest question.

link

lrobinovitch 1716 days ago

As I understand it, a data lake is a storage space for unstructured data. A data warehouse is a storage + compute layer, usually with data sourced from a data lake, that is ready for querying. This understanding comes from the description in this paper[1]

> To solve these problems, the second generation data analytics platforms started offloading all the raw data into data lakes: low-cost storage systems with a file API that hold data in generic and usually open file formats, such as Apache Parquet and ORC [8, 9]. This approach started with the Apache Hadoop movement [5], using the Hadoop File System (HDFS) for cheap storage. The data lake was a schema-on-read architecture that enabled the agility of storing any data at low cost, but on the other hand, punted the problem of data quality and governance downstream. In this architecture, a small subset of data in the lake would later be ETLed to a downstream data warehouse (such as Teradata) for the most important decision support and BI applications.

[1] http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

link

overkalix 1716 days ago

Reasonable enough I guess. Thanks!

link

cdcarter 1716 days ago

In my experience, a data warehouse usually has an ETL process at the beginning. Data comes in from disparate sources and on a regular basis, it is ETLd into a shape that is ready to use by the business. On the other hand, a data lake slurps in all the data as soon as it is available, in whatever form it is in. You have to process it into the business-consumable form when you query/egress it, but you don't have to know your dream schema up front.

link

lunatuna 1716 days ago

My experience is similar: extract process -> raw data -> clean/merge -> model

Normally you extract from source, then load to destination. There is no business logic in this process.

From raw you do all of your transforms to get clean up and merge and then get it into a usable model. With big data sets I've done wtih Hadoop and then moved the clean/merged data to a standard or MPP DB for analysts. For normal sets this can all be done in a standard DB.

The other part is all the data is available from raw and clean/merge for analysts to use and is kept. With the thinking the storage cost are extremely low and heading to zero. Whereas in traditional DW analysts used only the modeled sets and depending on the data earlier sets are deleted as they are for operational purposes only. Storage is considered expensive and limiting.

The move to ELT and using a declarative dataops tool has been mind bending and has been a multiplier in terms of speed to get to something usable. I don't want to see another DW again.

link

snidane 1716 days ago

Data lake: you own your files. Data warehouse: database owns your files.

link