Great start, if you keep at it i'd love to see more of the advanced stuff. I feel like we're all hitting problems like skew and it would be cool to have a reference for dealing with those.
Hey quadrature, thanks for the feedback! Would you be able to go into more details about what skew you see :)?
In chapter 7 I go into some methods of fixing skewed data when performing joins. This solved a majority of our skew problems, but we still see skew on aggregates I believe. I am working on how to debug/find skews in a spark application in Chapter 6, wanted to initially release this as I've been procrastinating over 2 years to do so lol.
We have done more spark parameter optimizations but that helps after the data skew have been resolved.
I’ve given a very introductory talk about what Arrow “gives for free” when using the right kind of UDF. It’s more fun in person, but with the references at the end and the presenter notes I think you could get an idea of what you will want to mention quicker than having to look at it from scratch. It’s [here](https://github.com/rberenguel/pyspark-arrow-pandas), I hope you find it useful!