| Here's the Scala Spark style guide: https://github.com/MrPowers/spark-style-guide The chispa README also provides a lot of useful info on how to properly write PySpark code: https://github.com/MrPowers/chispa Scala is easier than Python for Spark because it allows functions with multiple argument lists and isn't whitespace sensitive. Both are great & Spark is a lot of fun. Some specific notes: > Doing a select at the beginning of a PySpark transform, or before returning, is considered good practice Manual selects ensure column pruning is performed (column pruning only works for columnar file formats like Parquet). Spark does this automatically and always manually selecting may not be practical. Explicitly pruning columns is required for Pandas and Dask. > Be careful with joins! If you perform a left join, and the right side has multiple matches for a key, that row will be duplicated as many times as there are matches When performing joins, the first thing to think about is if a broadcast join is possible. Joins on clusters are hard. Then it's good to think about using a data stores that allows for predicate pushdown aggregations. |
It’s notable that in Spark 3.x, Koalas is standard, which adopts the Pandas API. Yet this style guide uses the Spark DataFrame API. So the guide might be a little stale anyways.
In my experience, it’s helpful to write queries in plain portable (or mostly portable) SQL, because once a Spark job becomes useful it often gets translated or refactored into something else. Definitely depends on the team / context, but plain SQL is often more widely accessible. For fast-moving data science stuff, it’s important to think about extensibility in terms of not just code (style & syntax) but people (who is going to remix this?).