Hacker News new | ask | show | jobs
by mremes 2623 days ago
I'll express an advocation for using SQL as a data pipelining language. Firstly, many SQL dialects are multi-platform and provide standardization for transformations. It's a declarative language that doesn't define how computation happens but what.

Where SQL is terrible to write is when one must pivot data. Each column transformation is defined separately (case whens). When the cardinality of a pivoted vector is high, it results in quite a verbose declaration. This problem can be mitigated for example by generating SQL programmatically with templating languages such as Jinja2. Rendering is handled nicely on platforms such as Airflow when running the rendered SQL in cloud (for example on top of Redshift or Presto cluster, BigQuery).

For writing complex transformations, UDFs and cascading subqueries are the way to go. Window functions are useful for scanning subsets of column values (useful for example in vector transformations [doing normalization, regularization etc.])

SQL is also a language with a gentle learning curve which makes it easy to learn for less software-engineering-minded people (BI people and analysts of different departments in a decentralized data science organization). It's established itself as a lingua franca for matrix transformations already for decades.

Data processing is usually done in batches of different intervals as in traditional data science nothing really needs real-time processing for single events. Then Spark shines. But I would rather make a tradeoff of using SQL and Spark side by side when handling real-time processing than losing benefits of using SQL that I listed above.

When data transformations – with some object ontology related to it other than "just maths" – are to be done real-time, then you better start thinking about building an application for that (using your favorite programming languages).

Even with Spark, around 70% of work is done in SparkSQL.

1 comments

I love SQL. But it hard to get other DS on board who think that 40 lines of Spark is better than a 10 line SQL query.

The only thing that worries me with SQL is when having to write UDFs for, say, computing a Z-score. But maybe it's just because I have never done it? Do you have any good resources about this?

Don’t worry, I’m having my battles convincing my clients (both business and DS/DEs) that this is a viable paradigm. Here’s a nice-looking z-value recipe by Silota that I just googled up: http://www.silota.com/docs/recipes/sql-z-score.html
Thanks! Do you have any tips on convincing people that SQL is a good paradigm?
I'd just go and write out the technical architecture, defining what are the inputs (the raw data) and what are the outputs (matrices for training, testing etc. etc.) on different intervals (usually, data scientists want the previous days' data processed into some format, A/B test results and such) and how are you going to instrument those transformations. It's not just SQL but the DB where that SQL would be run and orchestration (for example with Apache Airflow), and for concrete ETL tasks (nodes in a processing graph) using a combination of open-source modules (usually in Python) and Bash scripts.

It takes time to get experienced in explaining and mapping these things to the domain.