| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by loveparade 1043 days ago
	I guess it's surprising then that both Hadoop/Hive and Spark, which were the originators of SQL for ETL, typically work on data lakes instead of RDBMSs. In fact, RDBMs support didn't come for a long time. The choice of SQL has nothing to do with RDBMs. It's because SQL is a declarative language that's easy to parse and convert into a physical query plan that can be parallelized and optimized extremely well. Why is that? Because it's not a general-purpose imperative loosely typed brittle language like Python.

1 comments

dragonwriter 1043 days ago

> Hadoop/Hive and Spark, which were the originators of SQL for ETL

They weren’t.

I guarantee you, before either of those existed, when Data Warehousing was often done with a different version/configuration of the same brand of RDBMS as the transactional store (the latter likely using something closer to a normalized schema, the former using a star or snowflake schema), using SQL for ETL was absolutely normal.

Which is why newer data warehousing / data lake systems support SQL even though they aren’t RDBMSs: a couple decades of RDBMS dominance made it the JavaScript of data storage.

> Because it’s not a general-purpose imperative loosely typed brittle language like Python.

Its not general-purpose or imperative, its just as much “loosely typed” as Python (both Python and SQL are strongly typed.)

Its not clear what concrete meaning “brittle” is supposed to have in this claim, so I can’t evaluate its accuracy.

Alanhlwang 1043 days ago

Definitely, I can jump into what we meant by brittle—we mainly meant that SQL scripts are hard to debug/undescriptive, you can't parametrize and customize error messages that you receive from transforms, and you can only execute one complete statement at a time that are often chained together with CTEs (which is a nightmare if its a statement of 400 lines of SQL). Python makes it easier to debug since we turn the approach from a declarative to a procedural one, and that's even the case with breakpoints when you write your actual transformers in Python.