Hacker News new | ask | show | jobs
by ryanbrush 4860 days ago
I wish the post had gone into depth on _why_ Redshift was significantly faster, but I'm betting it uses in-memory joins whereas (hence the size limitations it mentions) whereas Hive joins are just MapReduce jobs that keep only minimal subsets of data in memory at a given point. The upshot is the Hive/MapReduce strategy isn't limited by physical memory.

Of course, if your data set can fit in memory, then Redshift or similar technologies probably is a better choice than Hive. But it's important to remember that the performance gains here come as the result of a tradeoff.

1 comments

It was significantly faster because as was mentioned above the graph ignores the the 17 HOURS it took for RedShift to import the data.

The comparison is complete and utter joke.