Hacker News new | ask | show | jobs
by zenjzen 4378 days ago
Completely agree with everything you said. I disagree about Impala, right now.

It has great potential, but I don't think it's prod-ready yet.

Also, why no mention of HBase?

1 comments

Although Impala is still a fairly new product, my team has been using it internally at Cloudera in production for over a year for real-time log analysis to our support engineers (http://bit.ly/USFQdh), among other ad-hoc BI analytics. We also have a bunch of customers who are using Impala to power very critical interactive workloads. What about Impala makes you feel like it's not production ready?

Good question about HBase, I didn't mention HBase because although it's a super fast NoSQL database, it's a lousy analytical database. Sure it's great at doing really fast scans over small slices of data (and/or updating data), but full table scans are extremely slow when compared to analyzing flat files in HDFS. For doing analytics in Hadoop, the format you almost always want is Parquet. Not only is reading files directly from HDFS faster, but Parquet is a true columnar store, so you pay a minimal IO penalty for queries since you only read necessary data. Also, Parquet uses some really efficient column encoding formats like (dictionary, delta, run length, ..) to reduce both IO and to increase the effectiveness of compression.

I knew I smelled me some Cloudera... :)

HBase:

I think HBase (based on the sorting of qualifiers within rows) would be suited toward the "ranking" problem, that's why I brought it up. I see this as being a map-only job (and possibly suited toward streaming, or not even using Hadoop at all). It would just be a quick scan/filter/pagination and then a quick ranking algo in some sort of API middle layer (how I envision this).

Impala:

I started using Impala around the 1.2.(don't remember) version which was at the tail-end of CDH4. I found that minimal increments (for instance from 1.2.1 to 1.2.2), would change query behavior and results. We were also using Impala with it's HBase connectivity, which I found to be very poor and about 100x slower than Hive+HBase. If I wanted parallelism to my queries against HBase tables, I had run my queries between row keys for each region and use some sort of "union all", which would increase performance and parallelize the query. Honestly, I'd consider dropping HBase from Impala until it can be made more stable and consistent with what you might expect with SQL queries. Some of the results from Impala didn't make any sense with regards to Impala + HBase (it's just a storage engine for Impala, right?), like joins and null handling. If I were to create these tables as Parquet (or even MySQL) with the same data, and run the same queries, Parquet + MySQL would agree, but Impala+HBase would diverge.

I think that Impala really kicks ass for ADHOC and infrequently run queries, but if you have a lot of concurrent queries, I don't think it handles the load very well (compared with something like Vertica). Perhaps this could be improved upon? We'd love to replace Vertica, and it seems that the only other product in its class is Impala.

I tried to use Parquet, but Parquet is really only suitable for bulk loads (not trickle loading). I was impressed with Parquet's query speed, but I had hard requirements preventing me from doing bulk loads. Impala+Parquet does deliver real-time queries/results, but the data can't be put in there in real-time, so I think this deserves a little asterisk.

BTWs:

BTW #1, do you have any matrices/data for the newer HBase (0.96.1.1+) and table scans? I find that I can table scan pretty well with a POC I put together on EC2. I can scan ~ 3 bn records (about 500m rows) per hour on a 8 node (7 active) cluster with 30.5 gb RAM and 800 gb SSD (i2.xl) on EC2. The company I'm currently at may be taking up some serious HBase. After pre-splitting my regions and disabling region splitting, I was able to keep it very stable without doing batched mutations with concurrent read and write. Before I disabled splitting, I was having a split/compaction storm that kept downing HBase. I use snappy compression on all CFs and I use bloom filters on the row-level.

BTW #2, your Cloudera retargeting for ads for me is wasting your money. We're already under the belt of Cloudera-paying customers. Just an FYI. :)

BTW #3, if you put "kill -9"'s (this may just be CDH 4-specific) into the GC on certain Cloudera-infused services (like HBase region servers), it would be nice if we could turn it off. Sometimes I don't mind some GC, but a cascading of region servers getting a "kill -9" just causes a cascade of badness.

Please don't think I'm shitting on you. I love Cloudera. As far as the Hadoop ecosystem goes, Cloudera is my _only_ choice. I cringe when people say MapR (very pushy inside sales, pain to install) or HortonWorks (too young). I've been using Hadoop since 2007, if it matters.

Yes, the HBase scanners in Impala are not very fast, and we know that. This is an area that needs improvement to maximize parallelism, but as of right now there are a bunch of things on the Impala roadmap that takes priority (disk-based aggregations/joins, window functions, nested data, order by without limit) to name a few.

As for Parquet, that file format is not designed for streaming, but instead is like you mentioned, it's meant for converting large datasets that you plan on running analytics on. Queries against data in parquet is _fast_, like really fast...I've seen queries go from 200 seconds down to 5 seconds by just converting the dataset to Parquet from text.

Concurrency in Impala is actually pretty good, and has always been a design goal from the beginning. I wouldn't compare Impala to Vertica or other analytical databases just yet, there's still a lot of room for improvement, but concurrency in Impala is much better than the other SQL on Hadoop engines (Hive, Presto, etc), and we've demonstrated that on our latest rounds of benchmarks.

BTW #1 - As I mentioned, HBase support Impala is pretty minimal at the moment, but still works fine for ad-hoc queries over small key spaces.

BTW #2 - hehe! I'll let our marketing team know :P

BTW #3 - I'm not sure what you mean here, I'll ask around to see if someone knows.

Thanks for your kind words! Sorry for the late response, I just saw that I had a response :)