Hacker News new | ask | show | jobs
by zenjzen 4380 days ago
I knew I smelled me some Cloudera... :)

HBase:

I think HBase (based on the sorting of qualifiers within rows) would be suited toward the "ranking" problem, that's why I brought it up. I see this as being a map-only job (and possibly suited toward streaming, or not even using Hadoop at all). It would just be a quick scan/filter/pagination and then a quick ranking algo in some sort of API middle layer (how I envision this).

Impala:

I started using Impala around the 1.2.(don't remember) version which was at the tail-end of CDH4. I found that minimal increments (for instance from 1.2.1 to 1.2.2), would change query behavior and results. We were also using Impala with it's HBase connectivity, which I found to be very poor and about 100x slower than Hive+HBase. If I wanted parallelism to my queries against HBase tables, I had run my queries between row keys for each region and use some sort of "union all", which would increase performance and parallelize the query. Honestly, I'd consider dropping HBase from Impala until it can be made more stable and consistent with what you might expect with SQL queries. Some of the results from Impala didn't make any sense with regards to Impala + HBase (it's just a storage engine for Impala, right?), like joins and null handling. If I were to create these tables as Parquet (or even MySQL) with the same data, and run the same queries, Parquet + MySQL would agree, but Impala+HBase would diverge.

I think that Impala really kicks ass for ADHOC and infrequently run queries, but if you have a lot of concurrent queries, I don't think it handles the load very well (compared with something like Vertica). Perhaps this could be improved upon? We'd love to replace Vertica, and it seems that the only other product in its class is Impala.

I tried to use Parquet, but Parquet is really only suitable for bulk loads (not trickle loading). I was impressed with Parquet's query speed, but I had hard requirements preventing me from doing bulk loads. Impala+Parquet does deliver real-time queries/results, but the data can't be put in there in real-time, so I think this deserves a little asterisk.

BTWs:

BTW #1, do you have any matrices/data for the newer HBase (0.96.1.1+) and table scans? I find that I can table scan pretty well with a POC I put together on EC2. I can scan ~ 3 bn records (about 500m rows) per hour on a 8 node (7 active) cluster with 30.5 gb RAM and 800 gb SSD (i2.xl) on EC2. The company I'm currently at may be taking up some serious HBase. After pre-splitting my regions and disabling region splitting, I was able to keep it very stable without doing batched mutations with concurrent read and write. Before I disabled splitting, I was having a split/compaction storm that kept downing HBase. I use snappy compression on all CFs and I use bloom filters on the row-level.

BTW #2, your Cloudera retargeting for ads for me is wasting your money. We're already under the belt of Cloudera-paying customers. Just an FYI. :)

BTW #3, if you put "kill -9"'s (this may just be CDH 4-specific) into the GC on certain Cloudera-infused services (like HBase region servers), it would be nice if we could turn it off. Sometimes I don't mind some GC, but a cascading of region servers getting a "kill -9" just causes a cascade of badness.

Please don't think I'm shitting on you. I love Cloudera. As far as the Hadoop ecosystem goes, Cloudera is my _only_ choice. I cringe when people say MapR (very pushy inside sales, pain to install) or HortonWorks (too young). I've been using Hadoop since 2007, if it matters.

1 comments

Yes, the HBase scanners in Impala are not very fast, and we know that. This is an area that needs improvement to maximize parallelism, but as of right now there are a bunch of things on the Impala roadmap that takes priority (disk-based aggregations/joins, window functions, nested data, order by without limit) to name a few.

As for Parquet, that file format is not designed for streaming, but instead is like you mentioned, it's meant for converting large datasets that you plan on running analytics on. Queries against data in parquet is _fast_, like really fast...I've seen queries go from 200 seconds down to 5 seconds by just converting the dataset to Parquet from text.

Concurrency in Impala is actually pretty good, and has always been a design goal from the beginning. I wouldn't compare Impala to Vertica or other analytical databases just yet, there's still a lot of room for improvement, but concurrency in Impala is much better than the other SQL on Hadoop engines (Hive, Presto, etc), and we've demonstrated that on our latest rounds of benchmarks.

BTW #1 - As I mentioned, HBase support Impala is pretty minimal at the moment, but still works fine for ad-hoc queries over small key spaces.

BTW #2 - hehe! I'll let our marketing team know :P

BTW #3 - I'm not sure what you mean here, I'll ask around to see if someone knows.

Thanks for your kind words! Sorry for the late response, I just saw that I had a response :)