Hacker News new | ask | show | jobs
by joefkelley 2337 days ago
It's been about 4 years since I've been in this world, but I remember there being several products all doing a very similar thing: Presto, Hive, SparkSQL, Impala, perhaps some more I'm forgetting. Is the situation still the same? Or has Presto "won out" in any sense?
1 comments

Hive and Impala are databases.

Presto and SparkSQL are SQL interfaces to many different datasources, including Hive and Impala, but also any SQL database such as Postgres/Redis/etc, and many other types of databases, such as Cassandra and Redis; the SQL tools can query all these different types of databases with a unified SQL interface, and even do joins across them.

The difference between Presto and SparkSQL is that Presto is run on a multi-tenant cluster with automatic resource allocation. SparkSQL jobs tend to have to be allocated with a specific resource allocation ahead of time. This makes Presto is (in my experience) a little more user-friendly. On the other hand, SparkSQL has better support for writing data to different datasources, whereas Presto pretty much only supports collecting results from a client or writing data into Hive.

I think some of this might be misinformed.

I know Hive can definitely query other datasources like traditional SQL databases, redis, cassandra, hbase, elasticsearch, etc, etc. I thought Impala had some bit of support for this as well, though I'm less familiar with it.

And SparkSQL can be run on a multi-tenant cluster with automatic resource allocation - Mesos, YARN, or Kubernetes.