Hacker News new | ask | show | jobs
by cwsteinbach 4862 days ago
Disclaimer: I'm a committer on the Apache Hive project.

A couple points in no particular order:

* EMR Hive is a closed source fork of the upstream Apache Hive code base. The EMR docs imply that the latest version of EMR Hive is based on Apache Hive 0.8.1 (which was released more than a year ago), which means EMR users aren't benefitting from the performance improvements that appeared in the 0.9 and 0.10 releases.

* It is implied (though not explicitly stated) that the Hive queries were run against gzip compressed TSV files stored in S3, while Redshift was allowed to spend 17 hours converting the same data to its own optimized internal format. Hive supports an optimized columnar format too (RCFile). Why wasn't that used in this performance comparison?

3 comments

Why wasn't that used in this performance comparison?

Because then the stupid headline wouldn't be so sensationalist, would it?

// I have no dog in this fight, but hate twisted claims

great feedback. I was also skeptical of using EMR hive due to the fact that it is so far behind in versions. Also RedShift can do the analytics part very well but I don't think it can do the exploration part that Hadoop/Hive are so good at (but maybe I am wrong)
Wow. A year ago a solution architect promised me they would catch up to mainline to get a bunch of critical bugfixes.