Hacker News new | ask | show | jobs
by MrPowers 2065 days ago
Here's the Scala Spark style guide: https://github.com/MrPowers/spark-style-guide

The chispa README also provides a lot of useful info on how to properly write PySpark code: https://github.com/MrPowers/chispa

Scala is easier than Python for Spark because it allows functions with multiple argument lists and isn't whitespace sensitive. Both are great & Spark is a lot of fun.

Some specific notes:

> Doing a select at the beginning of a PySpark transform, or before returning, is considered good practice

Manual selects ensure column pruning is performed (column pruning only works for columnar file formats like Parquet). Spark does this automatically and always manually selecting may not be practical. Explicitly pruning columns is required for Pandas and Dask.

> Be careful with joins! If you perform a left join, and the right side has multiple matches for a key, that row will be duplicated as many times as there are matches

When performing joins, the first thing to think about is if a broadcast join is possible. Joins on clusters are hard. Then it's good to think about using a data stores that allows for predicate pushdown aggregations.

1 comments

Agree that they missed broadcast joins, which can greatly impact how you’d go about a query versus plain SQL for big data. One of the best parts about Spark is how it supports rapid iteration—- you can use it to discover what joins are computationally infeasible.

It’s notable that in Spark 3.x, Koalas is standard, which adopts the Pandas API. Yet this style guide uses the Spark DataFrame API. So the guide might be a little stale anyways.

In my experience, it’s helpful to write queries in plain portable (or mostly portable) SQL, because once a Spark job becomes useful it often gets translated or refactored into something else. Definitely depends on the team / context, but plain SQL is often more widely accessible. For fast-moving data science stuff, it’s important to think about extensibility in terms of not just code (style & syntax) but people (who is going to remix this?).

I'd argue Koalas is an anti-pattern but will have to justify that in a blog post ;)

I've written some popular Scala Spark (https://github.com/MrPowers/spark-daria) and PySpark (https://github.com/MrPowers/quinn) libraries that have been adopted by a variety of teams. Not sure how to make a reusable with pure SQL, but sounds like it's possible. Send me a code snippet or link if you have anything I can take a look at to learn more about your approach.

I’m also not happy with Koalas but at least it’s a step towards API unification.

Pure SQL vs DataFrame— just write any typical join, groupby & count OLAP query as SQL and again using the DataFrame API. I’m saying the SQL query is more accessible to non-Spark users (e.g. a DBA who might need to approve your code) and as-is can be thrown into Hive/Presto or any RDBMS pretty easily. The DataFrame version is definitely more extensible, but in my experience Spark is more often used to inform the design of a larger data pipeline versus serve as the pipeline year after year. Appreciate there are places where the opposite is true.