Hacker News new | ask | show | jobs
by agacera 2222 days ago
Apache Drill is an interesting project, from all the MPP engines that appeared a few years ago, it was the most similar one to BigQuery (the first public version) and the most flexible.

However, the competion was fierce and each Big Data vendor (MapR, Cloudera and HortonWorks) was pushing its own solution: Drill, Impala and Hive on Tez. Competion is always a good thing, but it fragmented the user base too much so no clear winner emerged.

At the same time, Spark SQL got sufficiently better to replace these tools in most use cases and Presto (from Facebook) got the traction and the user base that none of these projects had by being vendor agnostic (and its adoption by AWS in Athena and EMR also helped boost its popularity).

4 comments

There is a common belief that SparkSQL is better than Hive because SparkSQL uses in-memory computing while Hive is disk-based. Another common belief is that Presto is better than Hive because it is based on MPP design and was invented for the very purpose of overcoming the slow speed of Hive by the very company (Facebook) that invented Hive in early 2010s.

The reality is that nowadays both SparkSQL and Presto are way behind Hive, in terms of both speed and maturity. Hive made tremendous progress since 2015 (with the introduction of LLAP), while SparkSQL still has the issue of stability of fault tolerance and shuffling. (Presto does not support fault tolerance.) So, IMO, SparkSQL is nowhere near ready to replace Hive.

If you are curious about the performance of these systems, see [1] and [2] which compare Hive, SparkSQL, and Presto. Disclaimer: We are developing MR3 mentioned in the articles. However, we tried to make a fair comparison in the performance evalaution.

[1] https://mr3.postech.ac.kr/blog/2019/11/07/sparksql2.3.2-0.10... [2] https://mr3.postech.ac.kr/blog/2019/08/22/comparison-presto3...

We have never been able to make Hive LLAP run reliably on our HDP cluster, queries sometimes just hang for no apparent reason.

On the other hand, our Presto cluster runs pretty much anything we throw at it, and when it fails, the failures are easier to anticipate and mitigate. It's also quite simple to deploy and operate.

Could you expand more on the reasons why Hive is faster than Spark? Aren't Hive joins also achieved via a MapReduce shuffle?
Query plans are heavily optimized, and map-side joins are used extensively. The use of optimizations exploiting memory makes the so-called in-memory computing of Spark no longer relevant because Hive also uses memory efficiently. Hive community is actively working toward Hive 4, so I guess the future version will be even faster.
Spark also has a query plan optimizer, and uses map-side joins (referred to as broadcast joins) whenever it makes sense. I'm just curious what other differences architecturally in your opinion can result in a performance discrepancy?
While I cannot give a definitive answer because I am not an expert on Spark internals, my opinion is that the discrepancy results mainly from query and runtime optimization.

Apart from adding new features (e.g., ACID support), a lot of effort is still put into optimizing queries and runtime. In essence, Hive is a tool specialized for SQL, so it tries to implement all the optimizations you can think of in the context of executing SQL queries. For example, Hive implements vectorized execution, whose counterpart in Spark was implemented only in a later version (with the introduction on Tungsten IIRC). Hive even supports query re-execution: if a query fails with a fatal error like OOM, Hive re-generates a new query after analyzing the runtime statistics collected by then. The second query usually runs much faster, and you can also update the column statistics in Metastore.

In contrast, Spark is a general-purpose execution engine where SparkSQL is just an application. I remember someone comparing Spark to Swiss army knife, which enables you to do a lot of things easily, but is no match against specialized tools developed for a particular task. My (opinionated) guess is that SparkSQL will be replaced by Hive and Presto, and Spark streaming will be replaced by Flink.

Was SparkSQL ever intended to replace hive? My impression was that it was supposed to supplement spark for times it was convenient. I kind of suspected at one point they got caught up in the SQL hadoop race, but I always felt like it was best to do SQL elsewhere, and save spark for things that couldn't be easily expressed in SQL.
The original SparkSQL was pretty much modelled after the Hive flavour of SQL, down to the available udfs. The compatibility was never complete and has somewhat diverged again with respective releases of the frameworks, but for the most part, Hive was the big data framework to beat at the time (2015-2016), and not everyone wanted to write Scala.

I think that now, maintaining that compatibility is less of a need for Spark and Hive has introduced a lot of goodies in the meantime, so there might not be a need for the SQL flavors to be in lockstep anymore.

SQL can be used as a dataframe, or a hive temp view that can be called from other SQL. That gives flexibility to mix and match SQL and programmatic logic within the same spark app.
IIRC, earlier in the project, a differentiator for Drill was to be the ability to run drillbit processes across servers, and run distributed queries from one of them with Zookeeper as a coordinator. This would have been a simple approach to distributed queries where secondary extract and loading into a distributed filesystem or Parquet file was not desired [1]

Unfortunately to date, distributed queries will fail if the paths _and_ files are not symmetric in name - all file paths and names must exist on all nodes - therefore the "in situ" approach is not available. It appears the project focused on querying distributed file systems like HDFS and S3 and therefore had a lot of competition.

I hope some group picks up where HPE orphaned Drill after the MapR acquisition and pivots to a pure distributed worker approach. Running a drillbit on nodes where the data originates could be useful, the original example was SQL over http logs directly from webservers.

[1] https://mapr.com/blog/drill-your-big-data-today-apache-drill...

I think it's important to add notes about the "Hadoop eras" in which some of these were first developed and evolved.

Hadoop 1.x (i.e. "MapReduce" execution engine):

* Apache Pig

* Apache Hive

* Apache Drill

* Cloudera Impala

If I recall correctly, neither Drill nor Impala actually used Hadoop 1.x MapReduce as the execution engine, and were mostly bundled to read data commonly stored in the HDFS cluster.

Hadoop 2.x (i.e. the MR2 / "YARN" era):

* Apache Pig

* Apache Hive

* Apache Tez (technically a substitute execution engine for MR2), built to allow containers to persist, optimize coalesce operation/task stages, amongst other things to reduce overall job latency

* Apache Spark (technically a substitute execution engine for MR2)

Spark entered the Hadoop ecosystem, as many people were storing their data in HDFS, and the Hadoop 2 YARN resource model/containers provided the compute resources to run Spark as an execution engine, in lieu of MR2. You could and can also run a separate Spark-dedicated cluster, but many people were already running Hadoop and storing their data in HDFS clusters.

"Shark" became SparkSQL somewhere around Spark 1.3-1.4x? and Schema-ed RDDs evolved to DataFrames and better enabled people to reason and interface with their data in a table-like manner. Python/PySpark performance also rapidly improved from things like Project Tungsten and DataFrames.

https://databricks.com/blog/2015/02/17/introducing-dataframe...

Post-Hadoop / MR2:

* Hive

* Spark

* Presto

Tez was very much backed by Hortonworks, as part of their HDP Hadoop distribution, and motivated improve the performance of existing Apache Pig and Hive tools (major contributors from Yahoo, Microsoft, Hortonworks). Hortonworks later incorporated Spark as part of their distribution.

Spark was adopted by Cloudera as part of their CDH Hadoop distribution, and coexisted with Impala.

Tn the post-Hadoop / post-Spark world, both Hortonworks and Cloudera merged as well:

https://www.cloudera.com/about/news-and-blogs/press-releases...

Also since we're talking MPP withSQL/SQL-like dialects, we may as well mention that Greenplum, ParAccel/Redshift also coexisted with all of these.

Great post :) This reads like a History Channel Documentary !
I've not spent much time, but I've never exactly understood what Presto is. Is it just map reduce across databases?
Its a distributed SQL engine that can query files from various database engines (via connectors or JDBC drivers), including structured file formats like CSV or Parquet (using the Hive metastore).

Presto does not manage storage itself, but instead focuses on fronting those data sources with a single access point, with the option to federate (join) different sources in a single query.

"Presto is an open-source distributed SQL query engine optimized for low-latency, ad-hoc analysis of data. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can process data from multiple data sources including the Hadoop Distributed File System (HDFS) and Amazon S3"

TIL that Presto is available in EMR.

Not only that, but, AWS Athena is basically serverless Presto. It's an extremely handy tool particularly if you've got structured or semi-structured data being dumped into S3 and you want a near zero maintenance (only have to create schemas) way to explore it.
Its basically a compute engine that maintains all state in memory and does distributed computations similarly to Spark.

The big thing it adds is that it isn't stuck to any storage format. It has a connector that lets you load data into it from basically anything, from mysql dbs to hdfs files to whatever. So you can do cross database joins and just not care about where the data lives. You can also output to almost any database too.

Not being snarky, but is that basically a sql interface for map reduce across databases?
It's basically federated SQL. Nothing to do with map/reduce, really.
It loads everything from the data sources (presumably pushing as much down to the underlying database) and then does sql in ram. Pretty much the definition of map reduce. Federated sql wouldn’t give speed up across a single database but presto does.
Loading from data sources and doing SQL in ram is very much not the definition of Map Reduce.
It is basically federated SQL to get all the data you need into mem.

Then it does a bunch of techniques, one of witch is map reduce, to do computation.

Its like you took a distributed SQL server and made it never have its own storage outside of memory. Instead it operates on wherever the data is living.

Is there any form of distributed query engine in use today that doesn't fit the definition of the MapReduce pattern? Is describing a distributed query engine as MapReduce still a meaningful distinction from some other, non-MapReduce approach?
Sort of. The biggest difference is that it can be a pseudo-datawarehouse for analysts and data scientists over an object store (Eg s3) without needing to manage a complicated ETL process. AWS Athena goes even further by not needing to provision compute so that queries are run on ephemeral VMs over the object store.

Hive makes a terrible data warehouse no matter how SQL compatible it is.