Hacker News new | ask | show | jobs
by monstrado 4380 days ago
Although I have a lot of respect for the amplab, they did not do their due diligence with that benchmark. Mainly for a few reasons, they didn't test using columnar storage in Hadoop (ORC / Parquet), which is what Redshift is using underneath (a proprietary columnar store). Also, the most complicated query they ran was a two table join, and from what I can tell, there wasn't any concurrent workload testing.

(disclaimer: I'm a Cloudera employee):

I recommend checking out the following blog, not because my employer wrote it, but because the guys behind the benchmark did an incredible job making the benchmark competitive. They also show metrics that a lot of the other people are not showing, for example concurrent workload capabilities, CPU efficiency, etc.

Impala, Hive (on Tez), Shark, Presto

http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the...

1 comments

Impala does not currently support Serde last when I checked, which limits its usage for certain cases. And I would not treat any benchmark too seriously since every vendor probably would only know/be willing to tune its own products. Check the latest Spark SQL benchmark. http://databricks.com/blog/2014/06/02/exciting-performance-i...
You pay a significant resource penalty when using Serdes, and since performance is one of the biggest priorities to the Impala team, we decided to leave this out for now. A very common workaround is to use Hive to generate Parquet data from your custom data (using Serdes), and then use Impala for querying the Parquet data.

I disagree with your statement regarding not treating benchmarks from vendors seriously. As the article mentions, we made an effort to make these queries run as efficient as possible, even going so far as re-writing queries on competing engines to make them run faster. In fact, Databrick's engineers assisted us in making the Shark benchmarks as good as they could possibly get. The benchmark that I linked is very thorough, and even supplies the exact queries / scripts we used to perform the tests so you can do them yourself.