Show HN: A Hands-On Guide on PySpark Coding and Best Practices

Y	Hacker News new \| ask \| show \| jobs

	Show HN: A Hands-On Guide on PySpark Coding and Best Practices (github.com)
	52 points by ericxiao251 2681 days ago

3 comments

quadrature 2681 days ago

Great start, if you keep at it i'd love to see more of the advanced stuff. I feel like we're all hitting problems like skew and it would be cool to have a reference for dealing with those.

link

ericxiao251 2681 days ago

Hey quadrature, thanks for the feedback! Would you be able to go into more details about what skew you see :)?

In chapter 7 I go into some methods of fixing skewed data when performing joins. This solved a majority of our skew problems, but we still see skew on aggregates I believe. I am working on how to debug/find skews in a spark application in Chapter 6, wanted to initially release this as I've been procrastinating over 2 years to do so lol.

We have done more spark parameter optimizations but that helps after the data skew have been resolved.

link

antisocial 2681 days ago

I didn't find Apache Arrow in this repo. I would like to learn more about your experience with using arrow, performance improvements and any lessons.

link

ericxiao251 2681 days ago

I haven't looked into/keep up with Arrow much, but if I see fit, I can add more stuff about it :)!

link

RBerenguel 2681 days ago

I’ve given a very introductory talk about what Arrow “gives for free” when using the right kind of UDF. It’s more fun in person, but with the references at the end and the presenter notes I think you could get an idea of what you will want to mention quicker than having to look at it from scratch. It’s [here](https://github.com/rberenguel/pyspark-arrow-pandas), I hope you find it useful!

link

ericxiao251 2681 days ago

Oh awesome thanks for the resources! I will definitely see how i can incorporate it into my guide :).

link

paulgb 2681 days ago

This is great and much needed! Looking forward to chapter 6. Wish I had the other chapters when I was getting started with spark.

link

ericxiao251 2681 days ago

Yes I agree, that was the whole premise of the repo 2 years ago!

I'm glad you like it :)!

link