Hacker News new | ask | show | jobs
by remilouf 2623 days ago
I love SQL. But it hard to get other DS on board who think that 40 lines of Spark is better than a 10 line SQL query.

The only thing that worries me with SQL is when having to write UDFs for, say, computing a Z-score. But maybe it's just because I have never done it? Do you have any good resources about this?

1 comments

Don’t worry, I’m having my battles convincing my clients (both business and DS/DEs) that this is a viable paradigm. Here’s a nice-looking z-value recipe by Silota that I just googled up: http://www.silota.com/docs/recipes/sql-z-score.html
Thanks! Do you have any tips on convincing people that SQL is a good paradigm?
I'd just go and write out the technical architecture, defining what are the inputs (the raw data) and what are the outputs (matrices for training, testing etc. etc.) on different intervals (usually, data scientists want the previous days' data processed into some format, A/B test results and such) and how are you going to instrument those transformations. It's not just SQL but the DB where that SQL would be run and orchestration (for example with Apache Airflow), and for concrete ETL tasks (nodes in a processing graph) using a combination of open-source modules (usually in Python) and Bash scripts.

It takes time to get experienced in explaining and mapping these things to the domain.