| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by choppaface 2064 days ago

Agree that they missed broadcast joins, which can greatly impact how you’d go about a query versus plain SQL for big data. One of the best parts about Spark is how it supports rapid iteration—- you can use it to discover what joins are computationally infeasible.

It’s notable that in Spark 3.x, Koalas is standard, which adopts the Pandas API. Yet this style guide uses the Spark DataFrame API. So the guide might be a little stale anyways.

In my experience, it’s helpful to write queries in plain portable (or mostly portable) SQL, because once a Spark job becomes useful it often gets translated or refactored into something else. Definitely depends on the team / context, but plain SQL is often more widely accessible. For fast-moving data science stuff, it’s important to think about extensibility in terms of not just code (style & syntax) but people (who is going to remix this?).

1 comments

MrPowers 2064 days ago

I'd argue Koalas is an anti-pattern but will have to justify that in a blog post ;)

I've written some popular Scala Spark (https://github.com/MrPowers/spark-daria) and PySpark (https://github.com/MrPowers/quinn) libraries that have been adopted by a variety of teams. Not sure how to make a reusable with pure SQL, but sounds like it's possible. Send me a code snippet or link if you have anything I can take a look at to learn more about your approach.

link

choppaface 2062 days ago

I’m also not happy with Koalas but at least it’s a step towards API unification.

Pure SQL vs DataFrame— just write any typical join, groupby & count OLAP query as SQL and again using the DataFrame API. I’m saying the SQL query is more accessible to non-Spark users (e.g. a DBA who might need to approve your code) and as-is can be thrown into Hive/Presto or any RDBMS pretty easily. The DataFrame version is definitely more extensible, but in my experience Spark is more often used to inform the design of a larger data pipeline versus serve as the pipeline year after year. Appreciate there are places where the opposite is true.

link