Hacker News new | ask | show | jobs
by monstrado 4378 days ago
These type of articles baffle me, you're comparing a high-performance analytical database to a batch-orientated SQL engine.

The whole point behind these query engines on Hadoop (Hive, Presto, Impala, etc) is to separate the database from the query engine. With these engines you can project schemas over raw data in its original form, without having to load it into a table. With Redshift, or other similar analytical databases, you're forced to define a schema, and then load the data in row by row...bulk inserts are very slow in comparison to Hadoop technologies.

Regardless, Hive in general should nver be used for interactive analytics, that's not what it's intended for. Where Hive shines is when you can dump 250TB of raw text data into a folder and then run a SQL query to extract useful information out of it. The extracted data could then be loaded into a RDBMS like RedShift for real-time reporting.

With all that being said, if you want to run SQL queries on data in Hadoop at the speeds of Redshift, you should have used Impala with Parquet, which is known to be even faster than Redshift in many cases, and is based on the same technology Google uses (Dremel and F1). The benefits of keeping your data in Hadoop are enormous, not every problem can be solved using SQL. The same data you're querying with Impala could actually be used to do machine learning using Spark or Mahout. Maybe you want to start indexing one of your tables into Solr to provide search capabilities on a subset of your columns to your users...or maybe you want to use Giraph or Sparks' GraphX to do parallel graph computation. The data never moves, there's still only ONE copy of that data in Hadoop, and you can bring any kind of workload to it.

3 comments

So we recently did this same switch. I can say that in our experience data pulls that were taking (on a good day) 6 hours now take under 10 seconds. There are some jobs we had that literally took days to run that now complete in minutes. Redshift is amazingly simple to use, crazy fast and so very cost effective.

Generally we're dealing with datasets that are around 1-3TB, and pretty well organized. Its just amazing how forgiving Redshift is when it comes to unusually written SQL and how useful it is to us as a business. Extracting serious insights was once a job that only a few people could do, now its something that anyone with a SQL background can manage.

Redshift is an especially limited SQL engine considering it doesn't support UDFs. It is wicked fast, but what you get in speed you lose in flexibility. Current (well, February, but fairly current) benchmarks[0] place Impala and Shark (SQL on top of Spark) within grasp of Redshift while pulling data from disk and, for certain workloads, on par or faster than Redshift. This is without using a columnar file format.

Impala is impressive technology, but it does require you to run dedicated Impala daemons as it doesn't use map reduce under the hood. Shark is especially interesting, however, because it is fast AND build on top of spark, so you can run raw Spark jobs, SQL queries, graph processing and ML all on the same cluster. Shark currently uses Hive to generate it's query plans, but the Spark project is working on implementing it's own SQL engine called Catalyst[1] that promises to be a significant improvement.

[0] https://amplab.cs.berkeley.edu/benchmark/

[1] https://spark-summit.org/talk/armbrust-catalyst-a-query-opti...

Although I have a lot of respect for the amplab, they did not do their due diligence with that benchmark. Mainly for a few reasons, they didn't test using columnar storage in Hadoop (ORC / Parquet), which is what Redshift is using underneath (a proprietary columnar store). Also, the most complicated query they ran was a two table join, and from what I can tell, there wasn't any concurrent workload testing.

(disclaimer: I'm a Cloudera employee):

I recommend checking out the following blog, not because my employer wrote it, but because the guys behind the benchmark did an incredible job making the benchmark competitive. They also show metrics that a lot of the other people are not showing, for example concurrent workload capabilities, CPU efficiency, etc.

Impala, Hive (on Tez), Shark, Presto

http://blog.cloudera.com/blog/2014/05/new-sql-choices-in-the...

Impala does not currently support Serde last when I checked, which limits its usage for certain cases. And I would not treat any benchmark too seriously since every vendor probably would only know/be willing to tune its own products. Check the latest Spark SQL benchmark. http://databricks.com/blog/2014/06/02/exciting-performance-i...
You pay a significant resource penalty when using Serdes, and since performance is one of the biggest priorities to the Impala team, we decided to leave this out for now. A very common workaround is to use Hive to generate Parquet data from your custom data (using Serdes), and then use Impala for querying the Parquet data.

I disagree with your statement regarding not treating benchmarks from vendors seriously. As the article mentions, we made an effort to make these queries run as efficient as possible, even going so far as re-writing queries on competing engines to make them run faster. In fact, Databrick's engineers assisted us in making the Shark benchmarks as good as they could possibly get. The benchmark that I linked is very thorough, and even supplies the exact queries / scripts we used to perform the tests so you can do them yourself.

Much like the Clouderan commenter, I wouldn't put a lot of stock in Berkeley's Big Data Benchmark. I reran a similar test with columnar storage and found Impala handily beats Shark. Operationally it's also much easier to deploy (provided you're on EMR or CDH). The "dedicated nodes" argument is kind of FUD, you can use LLAMA for resource sharing, and you need to colocate imapalad with DataNodes to achieve decent performance anyways. So YARN, Spark and Impala can all play nice together on the same cluster.

The queries and data set Berkeley chooses are bizarre. TPC-DS or TPC-H are much more representative of real-world performance, and the differences are more pronounced as the queries get more complex.

edit: I also don't understand why the Spark team is reinventing the wheel for Spark SQL when Hive running on Tez will produce very similar query plans. The two projects are converging to the same place, but they insist on having divergent code bases ;)

Llama+Impala isn't quite ready for prime time in my experience. The biggest issue is the reliance on Impala's query size estimates to determine how many resources to request from Yarn. We find that these estimates are frequently an order of magnitude or so away from reality.
Agreed, and also LLAMA doesn't support high-availability at the moment (soon to be fixed). We rely heavily on up to date table/column statistics in order to accurately determine resource consumption, and unfortunately Impala doesn't currently have incremental/background stats, something that should be in the 2.0 release.
Actian Matrix (Paraccel) does support UDFs, but you'd have to run your own cluster on premise.
Completely agree with everything you said. I disagree about Impala, right now.

It has great potential, but I don't think it's prod-ready yet.

Also, why no mention of HBase?

Although Impala is still a fairly new product, my team has been using it internally at Cloudera in production for over a year for real-time log analysis to our support engineers (http://bit.ly/USFQdh), among other ad-hoc BI analytics. We also have a bunch of customers who are using Impala to power very critical interactive workloads. What about Impala makes you feel like it's not production ready?

Good question about HBase, I didn't mention HBase because although it's a super fast NoSQL database, it's a lousy analytical database. Sure it's great at doing really fast scans over small slices of data (and/or updating data), but full table scans are extremely slow when compared to analyzing flat files in HDFS. For doing analytics in Hadoop, the format you almost always want is Parquet. Not only is reading files directly from HDFS faster, but Parquet is a true columnar store, so you pay a minimal IO penalty for queries since you only read necessary data. Also, Parquet uses some really efficient column encoding formats like (dictionary, delta, run length, ..) to reduce both IO and to increase the effectiveness of compression.

I knew I smelled me some Cloudera... :)

HBase:

I think HBase (based on the sorting of qualifiers within rows) would be suited toward the "ranking" problem, that's why I brought it up. I see this as being a map-only job (and possibly suited toward streaming, or not even using Hadoop at all). It would just be a quick scan/filter/pagination and then a quick ranking algo in some sort of API middle layer (how I envision this).

Impala:

I started using Impala around the 1.2.(don't remember) version which was at the tail-end of CDH4. I found that minimal increments (for instance from 1.2.1 to 1.2.2), would change query behavior and results. We were also using Impala with it's HBase connectivity, which I found to be very poor and about 100x slower than Hive+HBase. If I wanted parallelism to my queries against HBase tables, I had run my queries between row keys for each region and use some sort of "union all", which would increase performance and parallelize the query. Honestly, I'd consider dropping HBase from Impala until it can be made more stable and consistent with what you might expect with SQL queries. Some of the results from Impala didn't make any sense with regards to Impala + HBase (it's just a storage engine for Impala, right?), like joins and null handling. If I were to create these tables as Parquet (or even MySQL) with the same data, and run the same queries, Parquet + MySQL would agree, but Impala+HBase would diverge.

I think that Impala really kicks ass for ADHOC and infrequently run queries, but if you have a lot of concurrent queries, I don't think it handles the load very well (compared with something like Vertica). Perhaps this could be improved upon? We'd love to replace Vertica, and it seems that the only other product in its class is Impala.

I tried to use Parquet, but Parquet is really only suitable for bulk loads (not trickle loading). I was impressed with Parquet's query speed, but I had hard requirements preventing me from doing bulk loads. Impala+Parquet does deliver real-time queries/results, but the data can't be put in there in real-time, so I think this deserves a little asterisk.

BTWs:

BTW #1, do you have any matrices/data for the newer HBase (0.96.1.1+) and table scans? I find that I can table scan pretty well with a POC I put together on EC2. I can scan ~ 3 bn records (about 500m rows) per hour on a 8 node (7 active) cluster with 30.5 gb RAM and 800 gb SSD (i2.xl) on EC2. The company I'm currently at may be taking up some serious HBase. After pre-splitting my regions and disabling region splitting, I was able to keep it very stable without doing batched mutations with concurrent read and write. Before I disabled splitting, I was having a split/compaction storm that kept downing HBase. I use snappy compression on all CFs and I use bloom filters on the row-level.

BTW #2, your Cloudera retargeting for ads for me is wasting your money. We're already under the belt of Cloudera-paying customers. Just an FYI. :)

BTW #3, if you put "kill -9"'s (this may just be CDH 4-specific) into the GC on certain Cloudera-infused services (like HBase region servers), it would be nice if we could turn it off. Sometimes I don't mind some GC, but a cascading of region servers getting a "kill -9" just causes a cascade of badness.

Please don't think I'm shitting on you. I love Cloudera. As far as the Hadoop ecosystem goes, Cloudera is my _only_ choice. I cringe when people say MapR (very pushy inside sales, pain to install) or HortonWorks (too young). I've been using Hadoop since 2007, if it matters.

Yes, the HBase scanners in Impala are not very fast, and we know that. This is an area that needs improvement to maximize parallelism, but as of right now there are a bunch of things on the Impala roadmap that takes priority (disk-based aggregations/joins, window functions, nested data, order by without limit) to name a few.

As for Parquet, that file format is not designed for streaming, but instead is like you mentioned, it's meant for converting large datasets that you plan on running analytics on. Queries against data in parquet is _fast_, like really fast...I've seen queries go from 200 seconds down to 5 seconds by just converting the dataset to Parquet from text.

Concurrency in Impala is actually pretty good, and has always been a design goal from the beginning. I wouldn't compare Impala to Vertica or other analytical databases just yet, there's still a lot of room for improvement, but concurrency in Impala is much better than the other SQL on Hadoop engines (Hive, Presto, etc), and we've demonstrated that on our latest rounds of benchmarks.

BTW #1 - As I mentioned, HBase support Impala is pretty minimal at the moment, but still works fine for ad-hoc queries over small key spaces.

BTW #2 - hehe! I'll let our marketing team know :P

BTW #3 - I'm not sure what you mean here, I'll ask around to see if someone knows.

Thanks for your kind words! Sorry for the late response, I just saw that I had a response :)