|
|
|
|
|
by alanctgardner3
4379 days ago
|
|
Much like the Clouderan commenter, I wouldn't put a lot of stock in Berkeley's Big Data Benchmark. I reran a similar test with columnar storage and found Impala handily beats Shark. Operationally it's also much easier to deploy (provided you're on EMR or CDH). The "dedicated nodes" argument is kind of FUD, you can use LLAMA for resource sharing, and you need to colocate imapalad with DataNodes to achieve decent performance anyways. So YARN, Spark and Impala can all play nice together on the same cluster. The queries and data set Berkeley chooses are bizarre. TPC-DS or TPC-H are much more representative of real-world performance, and the differences are more pronounced as the queries get more complex. edit: I also don't understand why the Spark team is reinventing the wheel for Spark SQL when Hive running on Tez will produce very similar query plans. The two projects are converging to the same place, but they insist on having divergent code bases ;) |
|