Hacker News new | ask | show | jobs
by em500 2062 days ago
I worked quite a lot in pandas, dplyr, data.table and pyspark for a few years. And even occasionally some scala spark and sparkR. But after getting a bit fed up with F.lit()-this, F.col()-that, and the umpteenth variation on SQL, nowadays I pretty much just stick with plain SQL. I believe I've found my Enlightenment.
2 comments

I have opposite experience. After trying pyspark functional pipelines (so many handy functions) plain SQL seems so hard to read/understand. The main probem is that order of execution is not equal to order of code lines. https://i.stack.imgur.com/6YuwE.jpg

another thing is that python is so cool for data processing, and when working with plain sql I feel lack of

    .rdd.map(my_python_processing_function)
Same for me. Python and Scala let users break up the logic into DataFrame transformations that can be unit tested, packaged into Wheel / JAR files, and easily reused in multiple contexts. Maintaining big, complex SQL codebases isn't easy.
Could not agree more. Similar to the The Principle of Least Privilege [1], I prefer to use SQL over pyspark if possible. [1] https://us-cert.cisa.gov/bsi/articles/knowledge/principles/l...