| HN Mirror

I’m also not happy with Koalas but at least it’s a step towards API unification.

Pure SQL vs DataFrame— just write any typical join, groupby & count OLAP query as SQL and again using the DataFrame API. I’m saying the SQL query is more accessible to non-Spark users (e.g. a DBA who might need to approve your code) and as-is can be thrown into Hive/Presto or any RDBMS pretty easily. The DataFrame version is definitely more extensible, but in my experience Spark is more often used to inform the design of a larger data pipeline versus serve as the pipeline year after year. Appreciate there are places where the opposite is true.