Hacker News new | ask | show | jobs
by bbrunner 4378 days ago
Redshift is an especially limited SQL engine considering it doesn't support UDFs. It is wicked fast, but what you get in speed you lose in flexibility. Current (well, February, but fairly current) benchmarks[0] place Impala and Shark (SQL on top of Spark) within grasp of Redshift while pulling data from disk and, for certain workloads, on par or faster than Redshift. This is without using a columnar file format.

Impala is impressive technology, but it does require you to run dedicated Impala daemons as it doesn't use map reduce under the hood. Shark is especially interesting, however, because it is fast AND build on top of spark, so you can run raw Spark jobs, SQL queries, graph processing and ML all on the same cluster. Shark currently uses Hive to generate it's query plans, but the Spark project is working on implementing it's own SQL engine called Catalyst[1] that promises to be a significant improvement.

[0] https://amplab.cs.berkeley.edu/benchmark/

[1] https://spark-summit.org/talk/armbrust-catalyst-a-query-opti...

3 comments

Although I have a lot of respect for the amplab, they did not do their due diligence with that benchmark. Mainly for a few reasons, they didn't test using columnar storage in Hadoop (ORC / Parquet), which is what Redshift is using underneath (a proprietary columnar store). Also, the most complicated query they ran was a two table join, and from what I can tell, there wasn't any concurrent workload testing.

(disclaimer: I'm a Cloudera employee):

I recommend checking out the following blog, not because my employer wrote it, but because the guys behind the benchmark did an incredible job making the benchmark competitive. They also show metrics that a lot of the other people are not showing, for example concurrent workload capabilities, CPU efficiency, etc.

Impala, Hive (on Tez), Shark, Presto

http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the...

Impala does not currently support Serde last when I checked, which limits its usage for certain cases. And I would not treat any benchmark too seriously since every vendor probably would only know/be willing to tune its own products. Check the latest Spark SQL benchmark. http://databricks.com/blog/2014/06/02/exciting-performance-i...
You pay a significant resource penalty when using Serdes, and since performance is one of the biggest priorities to the Impala team, we decided to leave this out for now. A very common workaround is to use Hive to generate Parquet data from your custom data (using Serdes), and then use Impala for querying the Parquet data.

I disagree with your statement regarding not treating benchmarks from vendors seriously. As the article mentions, we made an effort to make these queries run as efficient as possible, even going so far as re-writing queries on competing engines to make them run faster. In fact, Databrick's engineers assisted us in making the Shark benchmarks as good as they could possibly get. The benchmark that I linked is very thorough, and even supplies the exact queries / scripts we used to perform the tests so you can do them yourself.

Much like the Clouderan commenter, I wouldn't put a lot of stock in Berkeley's Big Data Benchmark. I reran a similar test with columnar storage and found Impala handily beats Shark. Operationally it's also much easier to deploy (provided you're on EMR or CDH). The "dedicated nodes" argument is kind of FUD, you can use LLAMA for resource sharing, and you need to colocate imapalad with DataNodes to achieve decent performance anyways. So YARN, Spark and Impala can all play nice together on the same cluster.

The queries and data set Berkeley chooses are bizarre. TPC-DS or TPC-H are much more representative of real-world performance, and the differences are more pronounced as the queries get more complex.

edit: I also don't understand why the Spark team is reinventing the wheel for Spark SQL when Hive running on Tez will produce very similar query plans. The two projects are converging to the same place, but they insist on having divergent code bases ;)

Llama+Impala isn't quite ready for prime time in my experience. The biggest issue is the reliance on Impala's query size estimates to determine how many resources to request from Yarn. We find that these estimates are frequently an order of magnitude or so away from reality.
Agreed, and also LLAMA doesn't support high-availability at the moment (soon to be fixed). We rely heavily on up to date table/column statistics in order to accurately determine resource consumption, and unfortunately Impala doesn't currently have incremental/background stats, something that should be in the 2.0 release.
Actian Matrix (Paraccel) does support UDFs, but you'd have to run your own cluster on premise.