Hacker News new | ask | show | jobs
by cozos 2224 days ago
Could you expand more on the reasons why Hive is faster than Spark? Aren't Hive joins also achieved via a MapReduce shuffle?
1 comments

Query plans are heavily optimized, and map-side joins are used extensively. The use of optimizations exploiting memory makes the so-called in-memory computing of Spark no longer relevant because Hive also uses memory efficiently. Hive community is actively working toward Hive 4, so I guess the future version will be even faster.
Spark also has a query plan optimizer, and uses map-side joins (referred to as broadcast joins) whenever it makes sense. I'm just curious what other differences architecturally in your opinion can result in a performance discrepancy?
While I cannot give a definitive answer because I am not an expert on Spark internals, my opinion is that the discrepancy results mainly from query and runtime optimization.

Apart from adding new features (e.g., ACID support), a lot of effort is still put into optimizing queries and runtime. In essence, Hive is a tool specialized for SQL, so it tries to implement all the optimizations you can think of in the context of executing SQL queries. For example, Hive implements vectorized execution, whose counterpart in Spark was implemented only in a later version (with the introduction on Tungsten IIRC). Hive even supports query re-execution: if a query fails with a fatal error like OOM, Hive re-generates a new query after analyzing the runtime statistics collected by then. The second query usually runs much faster, and you can also update the column statistics in Metastore.

In contrast, Spark is a general-purpose execution engine where SparkSQL is just an application. I remember someone comparing Spark to Swiss army knife, which enables you to do a lot of things easily, but is no match against specialized tools developed for a particular task. My (opinionated) guess is that SparkSQL will be replaced by Hive and Presto, and Spark streaming will be replaced by Flink.