Hacker News new | ask | show | jobs
by epdlxjmonad 2225 days ago
There is a common belief that SparkSQL is better than Hive because SparkSQL uses in-memory computing while Hive is disk-based. Another common belief is that Presto is better than Hive because it is based on MPP design and was invented for the very purpose of overcoming the slow speed of Hive by the very company (Facebook) that invented Hive in early 2010s.

The reality is that nowadays both SparkSQL and Presto are way behind Hive, in terms of both speed and maturity. Hive made tremendous progress since 2015 (with the introduction of LLAP), while SparkSQL still has the issue of stability of fault tolerance and shuffling. (Presto does not support fault tolerance.) So, IMO, SparkSQL is nowhere near ready to replace Hive.

If you are curious about the performance of these systems, see [1] and [2] which compare Hive, SparkSQL, and Presto. Disclaimer: We are developing MR3 mentioned in the articles. However, we tried to make a fair comparison in the performance evalaution.

[1] https://mr3.postech.ac.kr/blog/2019/11/07/sparksql2.3.2-0.10... [2] https://mr3.postech.ac.kr/blog/2019/08/22/comparison-presto3...

3 comments

We have never been able to make Hive LLAP run reliably on our HDP cluster, queries sometimes just hang for no apparent reason.

On the other hand, our Presto cluster runs pretty much anything we throw at it, and when it fails, the failures are easier to anticipate and mitigate. It's also quite simple to deploy and operate.

Could you expand more on the reasons why Hive is faster than Spark? Aren't Hive joins also achieved via a MapReduce shuffle?
Query plans are heavily optimized, and map-side joins are used extensively. The use of optimizations exploiting memory makes the so-called in-memory computing of Spark no longer relevant because Hive also uses memory efficiently. Hive community is actively working toward Hive 4, so I guess the future version will be even faster.
Spark also has a query plan optimizer, and uses map-side joins (referred to as broadcast joins) whenever it makes sense. I'm just curious what other differences architecturally in your opinion can result in a performance discrepancy?
While I cannot give a definitive answer because I am not an expert on Spark internals, my opinion is that the discrepancy results mainly from query and runtime optimization.

Apart from adding new features (e.g., ACID support), a lot of effort is still put into optimizing queries and runtime. In essence, Hive is a tool specialized for SQL, so it tries to implement all the optimizations you can think of in the context of executing SQL queries. For example, Hive implements vectorized execution, whose counterpart in Spark was implemented only in a later version (with the introduction on Tungsten IIRC). Hive even supports query re-execution: if a query fails with a fatal error like OOM, Hive re-generates a new query after analyzing the runtime statistics collected by then. The second query usually runs much faster, and you can also update the column statistics in Metastore.

In contrast, Spark is a general-purpose execution engine where SparkSQL is just an application. I remember someone comparing Spark to Swiss army knife, which enables you to do a lot of things easily, but is no match against specialized tools developed for a particular task. My (opinionated) guess is that SparkSQL will be replaced by Hive and Presto, and Spark streaming will be replaced by Flink.

Was SparkSQL ever intended to replace hive? My impression was that it was supposed to supplement spark for times it was convenient. I kind of suspected at one point they got caught up in the SQL hadoop race, but I always felt like it was best to do SQL elsewhere, and save spark for things that couldn't be easily expressed in SQL.
The original SparkSQL was pretty much modelled after the Hive flavour of SQL, down to the available udfs. The compatibility was never complete and has somewhat diverged again with respective releases of the frameworks, but for the most part, Hive was the big data framework to beat at the time (2015-2016), and not everyone wanted to write Scala.

I think that now, maintaining that compatibility is less of a need for Spark and Hive has introduced a lot of goodies in the meantime, so there might not be a need for the SQL flavors to be in lockstep anymore.

SQL can be used as a dataframe, or a hive temp view that can be called from other SQL. That gives flexibility to mix and match SQL and programmatic logic within the same spark app.