Hacker News new | ask | show | jobs
by slotrans 2061 days ago
I'm so confused.

These examples are all using the SQL-like features of Spark. Not a map() or flatMap() in sight.

So... why not just write SQL?

  df.registerTempTable('some_name')
  new_df = spark.sql("""select ... from some_name ...""")
All of this F.col(...) and .alias(...) and .withColumn(...) nonsense is a million times harder to read than proper SQL. I just don't understand what any of this is intended to accomplish.
2 comments

The spark.sql("""select ... from some_name ...""") bit is how to write pure SQL from the Scala or Python execution context. If you're in the SQL execution context, this syntax isn't required. I never write Spark code like this.

At first glance, the F.col(), .withColumn() syntax isn't as intuitive as pure SQL, but it has a lot of advantages when you get used to it. You can make abstractions, use programming language features like loops, and use IDEs.

I find the PySpark syntax to be uglier than Scala. Lots of teams are terrified to use Scala and that's the reason PySpark is so popular.

Because the examples use the Dataframe API and not the RDD API? At least as of Spark 2.4.7, both `Dataset.map()` and `Dataset.flatMap()` are still marked experimental.