Hacker News new | ask | show | jobs
by dzderic 4780 days ago
What kind of performance are you getting, and how many nodes do you have?
1 comments

We have a 14 node cluster, the nodes have anywhere between 4-6 disks. Performance has been pretty amazing, we can do ad-hoc queries on this 4.5B row table. Each node has read throughput at about ~1.3GB/s for full table scans (data is snappy compressed, store as RCFile: columnar).
That sounds pretty fantastic - when you say "ad-hoc" do you mean that it's fast enough to be directly queried from a UI - are we talking seconds or minutes for your queries?

What drawbacks have you found with Impala? I've been keeping an eye on it, and also Shark: http://shark.cs.berkeley.edu/

It depends, if you plan to scan our entire data set it could take 30-40 seconds (roughly ~2.8TB), but we have our data partitioned based on a key that makes sense for the kind of data you'd need to populate a web page and these queries are fast enough (< 2 seconds) for aggregations that come in via AJAX.

We haven't yet had a chance to optimize our environment either. For example, our nodes are still running a pretty old version of CentOS, so we have LLVM disabled (which would help a lot for huge batch computations...see http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala...).

Also, our data is stored in RCFile, which is not exactly the most optimized columnar storage format. We're working on a plan to get everything over the new Parquet (http://parquet.io/) columnar format for another boost in performance.

We haven't come across any real drawbacks using Impala as of yet, it fits our needs pretty well.

Disclaimer: I work for Cloudera in their internal Tools Team, we like to dog food our stuff :).

Edit: One drawback of Impala is the lack of UDF support, but this is something that will be coming in a later release.