Hacker News new | ask | show | jobs
Show HN: A Hands-On Guide on PySpark Coding and Best Practices (github.com)
52 points by ericxiao251 2681 days ago
3 comments

Great start, if you keep at it i'd love to see more of the advanced stuff. I feel like we're all hitting problems like skew and it would be cool to have a reference for dealing with those.
Hey quadrature, thanks for the feedback! Would you be able to go into more details about what skew you see :)?

In chapter 7 I go into some methods of fixing skewed data when performing joins. This solved a majority of our skew problems, but we still see skew on aggregates I believe. I am working on how to debug/find skews in a spark application in Chapter 6, wanted to initially release this as I've been procrastinating over 2 years to do so lol.

We have done more spark parameter optimizations but that helps after the data skew have been resolved.

I didn't find Apache Arrow in this repo. I would like to learn more about your experience with using arrow, performance improvements and any lessons.
I haven't looked into/keep up with Arrow much, but if I see fit, I can add more stuff about it :)!
I’ve given a very introductory talk about what Arrow “gives for free” when using the right kind of UDF. It’s more fun in person, but with the references at the end and the presenter notes I think you could get an idea of what you will want to mention quicker than having to look at it from scratch. It’s [here](https://github.com/rberenguel/pyspark-arrow-pandas), I hope you find it useful!
Oh awesome thanks for the resources! I will definitely see how i can incorporate it into my guide :).
This is great and much needed! Looking forward to chapter 6. Wish I had the other chapters when I was getting started with spark.
Yes I agree, that was the whole premise of the repo 2 years ago!

I'm glad you like it :)!