|
in other words, "I used a chainsaw to cut an apple and it SUCKED at it." If you're processing an amount of data that comfortably fits in memory on a single machine, then obviously Hadoop is going to perform poorly in comparison. The costs of scheduling a job onto N mappers/reducers, transferring code to each node, waiting for the slowest mapper/reducer to finish, transferring data from mappers -> reducers, replicating output on HDFS, etc. are well-understood. It's true that many people try to use Hadoop when they'd be better served with simpler solutions, but that does not justify the amount of shade that the author throws at it. |
I posit that these two assertions are contradictory.
My own understanding of the term "well understood" is that it is synonymous with "widely understood". If many people are still making the mistake of using Hadoop when those costs outweight the benefits, it seems that understanding isn't quite wide enough.
That said, although I have a grasp of when the tradeoff is so loopsided as to be obvious, I don't know where to go (or where to point other people to go) for a better understanding of where the boundary is.
Where should we go to better learn that understanding of those costs?