Hacker News new | ask | show | jobs
by alanctgardner3 4379 days ago
Much like the Clouderan commenter, I wouldn't put a lot of stock in Berkeley's Big Data Benchmark. I reran a similar test with columnar storage and found Impala handily beats Shark. Operationally it's also much easier to deploy (provided you're on EMR or CDH). The "dedicated nodes" argument is kind of FUD, you can use LLAMA for resource sharing, and you need to colocate imapalad with DataNodes to achieve decent performance anyways. So YARN, Spark and Impala can all play nice together on the same cluster.

The queries and data set Berkeley chooses are bizarre. TPC-DS or TPC-H are much more representative of real-world performance, and the differences are more pronounced as the queries get more complex.

edit: I also don't understand why the Spark team is reinventing the wheel for Spark SQL when Hive running on Tez will produce very similar query plans. The two projects are converging to the same place, but they insist on having divergent code bases ;)

1 comments

Llama+Impala isn't quite ready for prime time in my experience. The biggest issue is the reliance on Impala's query size estimates to determine how many resources to request from Yarn. We find that these estimates are frequently an order of magnitude or so away from reality.
Agreed, and also LLAMA doesn't support high-availability at the moment (soon to be fixed). We rely heavily on up to date table/column statistics in order to accurately determine resource consumption, and unfortunately Impala doesn't currently have incremental/background stats, something that should be in the 2.0 release.