| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by em500 2109 days ago
	I worked quite a lot in pandas, dplyr, data.table and pyspark for a few years. And even occasionally some scala spark and sparkR. But after getting a bit fed up with F.lit()-this, F.col()-that, and the umpteenth variation on SQL, nowadays I pretty much just stick with plain SQL. I believe I've found my Enlightenment.

2 comments

tandav 2109 days ago

I have opposite experience. After trying pyspark functional pipelines (so many handy functions) plain SQL seems so hard to read/understand. The main probem is that order of execution is not equal to order of code lines. https://i.stack.imgur.com/6YuwE.jpg

another thing is that python is so cool for data processing, and when working with plain sql I feel lack of

    .rdd.map(my_python_processing_function)

link

MrPowers 2108 days ago

Same for me. Python and Scala let users break up the logic into DataFrame transformations that can be unit tested, packaged into Wheel / JAR files, and easily reused in multiple contexts. Maintaining big, complex SQL codebases isn't easy.

link

iblaine 2108 days ago

Could not agree more. Similar to the The Principle of Least Privilege [1], I prefer to use SQL over pyspark if possible. [1] https://us-cert.cisa.gov/bsi/articles/knowledge/principles/l...

link