|
|
|
|
|
by bbrunner
4378 days ago
|
|
Redshift is an especially limited SQL engine considering it doesn't support UDFs. It is wicked fast, but what you get in speed you lose in flexibility. Current (well, February, but fairly current) benchmarks[0] place Impala and Shark (SQL on top of Spark) within grasp of Redshift while pulling data from disk and, for certain workloads, on par or faster than Redshift. This is without using a columnar file format. Impala is impressive technology, but it does require you to run dedicated Impala daemons as it doesn't use map reduce under the hood. Shark is especially interesting, however, because it is fast AND build on top of spark, so you can run raw Spark jobs, SQL queries, graph processing and ML all on the same cluster. Shark currently uses Hive to generate it's query plans, but the Spark project is working on implementing it's own SQL engine called Catalyst[1] that promises to be a significant improvement. [0] https://amplab.cs.berkeley.edu/benchmark/ [1] https://spark-summit.org/talk/armbrust-catalyst-a-query-opti... |
|
(disclaimer: I'm a Cloudera employee):
I recommend checking out the following blog, not because my employer wrote it, but because the guys behind the benchmark did an incredible job making the benchmark competitive. They also show metrics that a lot of the other people are not showing, for example concurrent workload capabilities, CPU efficiency, etc.
Impala, Hive (on Tez), Shark, Presto
http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the...