Hacker News new | ask | show | jobs
by MrPowers 2061 days ago
I'd argue Koalas is an anti-pattern but will have to justify that in a blog post ;)

I've written some popular Scala Spark (https://github.com/MrPowers/spark-daria) and PySpark (https://github.com/MrPowers/quinn) libraries that have been adopted by a variety of teams. Not sure how to make a reusable with pure SQL, but sounds like it's possible. Send me a code snippet or link if you have anything I can take a look at to learn more about your approach.

1 comments

I’m also not happy with Koalas but at least it’s a step towards API unification.

Pure SQL vs DataFrame— just write any typical join, groupby & count OLAP query as SQL and again using the DataFrame API. I’m saying the SQL query is more accessible to non-Spark users (e.g. a DBA who might need to approve your code) and as-is can be thrown into Hive/Presto or any RDBMS pretty easily. The DataFrame version is definitely more extensible, but in my experience Spark is more often used to inform the design of a larger data pipeline versus serve as the pipeline year after year. Appreciate there are places where the opposite is true.