| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by barrkel 3383 days ago

Relational algebra is a really useful model to think about ETL task generally; SQL is an awkward dialect to express relational algebra, but it is at least a well-known one, and reasonably portable for a subset of querying. You can see the payoff in the Hadoop ecosystem too: Hive with HQL, spark-sql, Impala - SQL being used to express a data flow graph with a bunch of relational operators.

When you program directly against Spark, you're effectively building SQL plans explicitly. It's both more indirect - instead of writing a program that does stuff, you write a program that creates a data flow graph that does stuff; and you have more responsibility for performance, for good and bad.

I think to get good performance, you simply can't think on a per-item basis. You need to orient your thinking towards what can be efficiently performed at the bulk level. Whether it's column scanning in HDFS, or index scanning in a RDBMS, you need to be aware of the engineering properties of the operators you're applying. Doing lots of things per-item is a recipe for blowing your budgets, whether it's cache, memory, I/O, whatever. You want to iteratively do a little work to lots of items, and then join, rather than lots of work to each item one at a time.