| > They're well understood by anyone who has used these technologies professionally. I probably should have been ore precise with my language, though So.. It's well understood by True Scotsmen? :) I certainly understood that there are experts in the field who have a deep, even intuitive understanding of how and when to use which tools. To the extent that those experts don't communicate that knowledge to a broader audience while at the same time may be advocating for the use of the tools, they bear some responsibility for the misuse. My point wasn't so much that you used imprecise, but rather, that the statement about how well it's understood was inaccurate or irrelevant (depending on which definition you were going for). > But as a general rule of thumb, my feeling is that for tens to hundreds of GB, you should consider it. And for TBs or more, you almost certainly want to be doing something distributed While a 3-4TB cutoff makes some sense if ones workload has to remain in-memory for performance reasons, that can't be anywhere near the cutoff for any kind of workload that could stand to read from SSDs. > I don't know if there's any resource out there that really goes deep into the tradeoffs involved though. There probably is, given how popular the subject is, but I'm not aware of one. I would hope so, but I'm not so sure. It may not even need to be very deep, something akin to the "5 minute rule" for memory/disk caching. Mostly, I'm not convinced that the subject of tradeoffs is actually popular, so much as just using the tool without considering them is. > The problem with the article is that if it's for a general audience that doesn't understand the tradeoffs of a system like Hadoop, it really paints a picture that it is just a bad, slow tool. If we're talking about the adamdrak.com 233x article, I have to disagree, as my read of it was that it focused on evangelizing the "under-used approach for data processing" of "standard shell tools and commands". > it is peppered with unnecessarily snide comments about Hadoop that will probably be more memorable That's certainly not a charitable interpretation, and I would hazard that it's not even fair or factual (as to "peppered", at least). It's mentioned only a handful of times: > Command-line Tools can be 235x Faster than your Hadoop Cluster I agree that click-bait can be considered snide. > I was skeptical of using Hadoop for the task, but I can understand his goal of learning and having fun with mrjob and EMR To me, this comment, this very first mention of Hadoop in the intro, made it clear that this was "rigged" in that the "competition" was neither competing nor truly concerned about performance. > while the Hadoop processing took about 26 minutes (processing speed of about 1.14MB/sec). Merely a factual summary. Nothing snide that I could detect. > Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data ™ tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques. This seems like just a restatement of the admission in the introduction plus the assertion (that I believe even you agree with) that many people mis-use Hadoop when it's not called for. > The resulting stream processing pipeline we will create will be over 235 times faster than the Hadoop implementation and use virtually no memory. Again, just another factual summary, with no snideness I could detect. > While we can certainly do better, assuming linear scaling this would have taken the Hadoop cluster approximately 52 minutes to process. This next mention is after at least half of the bulk of the article. It may be an assuming-spherical-cows estimation, but it doesn't strike me as grossly misleading on its face, and there's no editorializing. > This gets us up to approximately 77 times faster than the Hadoop implementation. > about 174 times faster than the Hadoop implementation. > gets us down to a runtime of about 12 seconds, or about 270MB/sec, which is around 235 times faster than the Hadoop implementation. The next three mentions, near the end, are just comparisons of the evolving demonstration implementation to the reference implementation. No detectable snideness. > Hopefully this has illustrated some points about using and abusing tools like Hadoop > but more often than not these days I see Hadoop used Here in the conclusion paragraph is where I agree that there is both snideness and where a reader may be confused about tradeoffs, if that's what they were expecting to be enlightened about. However, because that's not what the introduction promised, my criticism would be merely that the conclusion doesn't match the introduction (and maybe goes too far into inflammatory territory with "abuses"). Pretend that section isn't even in the article, and the article still reads OK. |
I still disagree about the article itself (even outside of the conclusion), but perhaps I am reading it uncharitably and other people are not getting the same impression. I do feel that it would be easy for someone who is not very familiar with these technologies to get the wrong impression. Misusage does probably go mainly in the other direction (of people overusing Hadoop rather than underusing), though, so maybe that is not so important a concern.