|
|
|
|
|
by choppaface
2064 days ago
|
|
Agree that they missed broadcast joins, which can greatly impact how you’d go about a query versus plain SQL for big data. One of the best parts about Spark is how it supports rapid iteration—- you can use it to discover what joins are computationally infeasible. It’s notable that in Spark 3.x, Koalas is standard, which adopts the Pandas API. Yet this style guide uses the Spark DataFrame API. So the guide might be a little stale anyways. In my experience, it’s helpful to write queries in plain portable (or mostly portable) SQL, because once a Spark job becomes useful it often gets translated or refactored into something else. Definitely depends on the team / context, but plain SQL is often more widely accessible. For fast-moving data science stuff, it’s important to think about extensibility in terms of not just code (style & syntax) but people (who is going to remix this?). |
|
I've written some popular Scala Spark (https://github.com/MrPowers/spark-daria) and PySpark (https://github.com/MrPowers/quinn) libraries that have been adopted by a variety of teams. Not sure how to make a reusable with pure SQL, but sounds like it's possible. Send me a code snippet or link if you have anything I can take a look at to learn more about your approach.