| HN Mirror

It depends, if you plan to scan our entire data set it could take 30-40 seconds (roughly ~2.8TB), but we have our data partitioned based on a key that makes sense for the kind of data you'd need to populate a web page and these queries are fast enough (< 2 seconds) for aggregations that come in via AJAX.

We haven't yet had a chance to optimize our environment either. For example, our nodes are still running a pretty old version of CentOS, so we have LLVM disabled (which would help a lot for huge batch computations...see http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala...).

Also, our data is stored in RCFile, which is not exactly the most optimized columnar storage format. We're working on a plan to get everything over the new Parquet (http://parquet.io/) columnar format for another boost in performance.

We haven't come across any real drawbacks using Impala as of yet, it fits our needs pretty well.

Disclaimer: I work for Cloudera in their internal Tools Team, we like to dog food our stuff :).

Edit: One drawback of Impala is the lack of UDF support, but this is something that will be coming in a later release.