|
|
|
|
|
by cwsteinbach
4862 days ago
|
|
Disclaimer: I'm a committer on the Apache Hive project. A couple points in no particular order: * EMR Hive is a closed source fork of the upstream Apache Hive code base. The EMR docs imply that the latest version of EMR Hive is based on Apache Hive 0.8.1 (which was released more than a year ago), which means EMR users aren't benefitting from the performance improvements that appeared in the 0.9 and 0.10 releases. * It is implied (though not explicitly stated) that the Hive queries were run against gzip compressed TSV files stored in S3, while Redshift was allowed to spend 17 hours converting the same data to its own optimized internal format. Hive supports an optimized columnar format too (RCFile). Why wasn't that used in this performance comparison? |
|
Because then the stupid headline wouldn't be so sensationalist, would it?
// I have no dog in this fight, but hate twisted claims