Hacker News new | ask | show | jobs
by mmt 2899 days ago
> are well-understood. It's true that many people try to use Hadoop when they'd be better served with simpler solutions

I posit that these two assertions are contradictory.

My own understanding of the term "well understood" is that it is synonymous with "widely understood". If many people are still making the mistake of using Hadoop when those costs outweight the benefits, it seems that understanding isn't quite wide enough.

That said, although I have a grasp of when the tradeoff is so loopsided as to be obvious, I don't know where to go (or where to point other people to go) for a better understanding of where the boundary is.

Where should we go to better learn that understanding of those costs?

3 comments

They're well understood by anyone who has used these technologies professionally. I probably should have been ore precise with my language, though. It's really a grey area as to where the boundary is and it depends a lot on your specific application. But as a general rule of thumb, my feeling is that for tens to hundreds of GB, you should consider it. And for TBs or more, you almost certainly want to be doing something distributed. Hadoop isn't necessarily the best option then, but it's a powerful tool. I don't know if there's any resource out there that really goes deep into the tradeoffs involved though. There probably is, given how popular the subject is, but I'm not aware of one.

The problem with the article is that if it's for a general audience that doesn't understand the tradeoffs of a system like Hadoop, it really paints a picture that it is just a bad, slow tool. It barely acknowledge just how rigged the comparison is at all, aside from mentioning that you might need something like Hadoop for really big data in the conclusion, while it is peppered with unnecessarily snide comments about Hadoop that will probably be more memorable. I think it is liable to leave readers more confused about the tradeoffs involved after reading than before.

> They're well understood by anyone who has used these technologies professionally.

We need to substitute the word "professionally" with more precise terms when talking about our industry. Because if one was to read "professionally" as "at work", then your statement is absolutely false - both the lower bound and average amounts of critical thinking and caring in this industry are extremely low. Even ignoring people who obviously have no clue, I can still imagine Hadoop and other big data stacks being sanctioned by "professionals" in management for buzzword-generating reasons, and implemented by "professional" "engineers" for CV padding reason.

You're probably right. I'm thinking of specific coworkers when I think of the level of knowledge that should be expected, and generally the standard I would hold coworkers to would include understanding this. But it is probably not as widely understood as I would hope.
> They're well understood by anyone who has used these technologies professionally. I probably should have been ore precise with my language, though

So.. It's well understood by True Scotsmen? :) I certainly understood that there are experts in the field who have a deep, even intuitive understanding of how and when to use which tools. To the extent that those experts don't communicate that knowledge to a broader audience while at the same time may be advocating for the use of the tools, they bear some responsibility for the misuse.

My point wasn't so much that you used imprecise, but rather, that the statement about how well it's understood was inaccurate or irrelevant (depending on which definition you were going for).

> But as a general rule of thumb, my feeling is that for tens to hundreds of GB, you should consider it. And for TBs or more, you almost certainly want to be doing something distributed

While a 3-4TB cutoff makes some sense if ones workload has to remain in-memory for performance reasons, that can't be anywhere near the cutoff for any kind of workload that could stand to read from SSDs.

> I don't know if there's any resource out there that really goes deep into the tradeoffs involved though. There probably is, given how popular the subject is, but I'm not aware of one.

I would hope so, but I'm not so sure. It may not even need to be very deep, something akin to the "5 minute rule" for memory/disk caching. Mostly, I'm not convinced that the subject of tradeoffs is actually popular, so much as just using the tool without considering them is.

> The problem with the article is that if it's for a general audience that doesn't understand the tradeoffs of a system like Hadoop, it really paints a picture that it is just a bad, slow tool.

If we're talking about the adamdrak.com 233x article, I have to disagree, as my read of it was that it focused on evangelizing the "under-used approach for data processing" of "standard shell tools and commands".

> it is peppered with unnecessarily snide comments about Hadoop that will probably be more memorable

That's certainly not a charitable interpretation, and I would hazard that it's not even fair or factual (as to "peppered", at least). It's mentioned only a handful of times:

> Command-line Tools can be 235x Faster than your Hadoop Cluster

I agree that click-bait can be considered snide.

> I was skeptical of using Hadoop for the task, but I can understand his goal of learning and having fun with mrjob and EMR

To me, this comment, this very first mention of Hadoop in the intro, made it clear that this was "rigged" in that the "competition" was neither competing nor truly concerned about performance.

> while the Hadoop processing took about 26 minutes (processing speed of about 1.14MB/sec).

Merely a factual summary. Nothing snide that I could detect.

> Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data ™ tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques.

This seems like just a restatement of the admission in the introduction plus the assertion (that I believe even you agree with) that many people mis-use Hadoop when it's not called for.

> The resulting stream processing pipeline we will create will be over 235 times faster than the Hadoop implementation and use virtually no memory.

Again, just another factual summary, with no snideness I could detect.

> While we can certainly do better, assuming linear scaling this would have taken the Hadoop cluster approximately 52 minutes to process.

This next mention is after at least half of the bulk of the article. It may be an assuming-spherical-cows estimation, but it doesn't strike me as grossly misleading on its face, and there's no editorializing.

> This gets us up to approximately 77 times faster than the Hadoop implementation.

> about 174 times faster than the Hadoop implementation.

> gets us down to a runtime of about 12 seconds, or about 270MB/sec, which is around 235 times faster than the Hadoop implementation.

The next three mentions, near the end, are just comparisons of the evolving demonstration implementation to the reference implementation. No detectable snideness.

> Hopefully this has illustrated some points about using and abusing tools like Hadoop

> but more often than not these days I see Hadoop used

Here in the conclusion paragraph is where I agree that there is both snideness and where a reader may be confused about tradeoffs, if that's what they were expecting to be enlightened about.

However, because that's not what the introduction promised, my criticism would be merely that the conclusion doesn't match the introduction (and maybe goes too far into inflammatory territory with "abuses").

Pretend that section isn't even in the article, and the article still reads OK.

I agree with your first two points. It's definitely possible to efficiently process quite a bit of data on a single machine, although I think that past a terabyte, there begin to be strong arguments to a distributed approach even if you can theoretically handle it on one machine (scalability if requirements change, resilience to machine failures, etc.).

I still disagree about the article itself (even outside of the conclusion), but perhaps I am reading it uncharitably and other people are not getting the same impression. I do feel that it would be easy for someone who is not very familiar with these technologies to get the wrong impression. Misusage does probably go mainly in the other direction (of people overusing Hadoop rather than underusing), though, so maybe that is not so important a concern.

> I think that past a terabyte, there begin to be strong arguments to a distributed approach even if you can theoretically handle it on one machine

I've elaborated on these in another comment, as well.

Today, drawing the line at a single terabyte is way too early, even for all-in-memory workloads, if only because there exists an almost 4TB AWS instance now. Any smaller than 3.5TB (or whatever RAM is available to applications) is, at best, living in the past.

> scalability if requirements change

This reads as premature optimization, which turns the strong argument into either a weak argument or even an argument against.

Now, if you know or have reasonable certainty that your requirements will change (and will do so faster than, say Moore's Law) and change soon, then that's different. I suspect there are people who think this, but that it's little more than wishful thinking or a delusion as to how large their slice of "web scale" actually is.

> resilience to machine failures

Machine failures just aren't a legitimate consideration for modern, high-end (but still commodity) hardware. You wouldn't bet your whole business on it, of course, but a 1% chance every year of losing an hour or two of batch processing? Sure.

Sadly, the flip side of this is that I see Hadoop clusters being built with such reliable servers, including redudant PSUs and fans, instead of taking full advantage of the resilience at the software level in order to save as much as possible at the hardware level. The original company behind map-reduce is certainly not splurging on hardware.

I'm not saying that past a terabyte is a point where you definitely want to use distributed processing, just that at that point, you should really strongly consider it. There is usually a lot of fuzziness around estimates you get about what sort of data volume you'll need to deal with, and it's not uncommon for it to vary by integral factors between days. If you're pushing the limits of what your system can handle without needing a dramatic rearchitecting, then that's a big risk, and it's not necessarily premature to build in the flexibility to have the option of scaling in the future if you need to. If you hit that 4TB and you still need more, it will be a big headache.

I can't really comment on rates of machine failures, but I have seen it happen before, even just for stupid reasons like someone in a data center unplugging a machine.

Fair enough, for in-memory only, if 1TB is your raw data, by the time it's indexed, it's going to be bigger.

Surely, though, workloads that require in-memory performance are fairly niche, and jumping from there to distributed (even in-memory) seems non-obvious, at best. Why aren't large arrays of fast SSDs a better alternative? The bandwidth is comparable, but the latency is terrible (still comparable to ethernet to a remote node, though?)

What about workloads that don't require fully-in-memory in the first place? If the cutoff is, then hundreds of TB, wouldn't that cover the vast majority of common use cases?

> I can't really comment on rates of machine failures, but I have seen it happen before, even just for stupid reasons like someone in a data center unplugging a machine.

That sort of anecdata isn't very useful, because a human can cause any failure at any layer, including someone stop a whole cluster, which I've seen happen before.

My point about it not being a legitimate concern is that what is now common practice with what is now common equipment means it's uncommon. These practices and equipment had to evolve, but that evolution happened on the order of over a decade ago.

Also, be wary of selection bias. It's very easy to remember the "fire drill" because of the one machine failure, and it makes a much more interesting story to tell that gets passed around and modified enough, eventually sounding like multiple stories and therefore multiple machines. The hundreds of servers that operated unheard and unseen for years, sometimes beyond their specs (e.g. with only only blower out of four still turning and only half-speed at that), get nary a thought, let alone mention.

> If many people are still making the mistake of using Hadoop when those costs outweight the benefits, it seems that understanding isn't quite wide enough.

That's probably true but, cynically, I tend to think a lot of the time when people use Hadoop they're not doing it because Hadoop is the best solution: they're doing it because the solution allows them to use Hadoop.

I'd venture that's a little too cynical, as it violates the "never attribute to malice what can adequately be explained by ineptitude" rule :)

I'm not suggesting that something like resume-padding is never a motivation, just that it seems unlikely to be the sole or even primary motivation.

Now, I may be wrong, but my reasoning behind this is that, for all its purported power as a tool, as potatoyogurt has alluded [1], it's really only the true experts who can wield that power effectively. The ones who merely succeeded at getting on their resume would only be able to use it where its power isn't truly needed (as was the case before).

If the technology falls out of fashion, then those resume-holders will be in a position of needing to deliver the cargo, rather than carved wooden headphones.

As such, I believe their motivation is actually to attempt to learn the true power of the tool (i.e. true resume building, rather than mere padding) and that they're grossly underestimating the costs through ignorance and a desire to learn by doing.

The trouble is, without well-published non-distributed reference implementations and, instead, ultra-popularity of the distributed tools instead, they never end up learning those costs, and we're in a state of perpetuated ignorance and perpetuated over-use.

[1] by saying that the cost trade-offs are well understood by the experts, which strongly implies that the experts have a pretty deep understanding of the actual mechanics of distributed computing in general.

Many of the alternates (including linux cli stuff) that are much faster require a re-thinking of attitude, don't work where there are tens to hundreds of people submitting queries, or require different skills. It's tragic to think of all of the computrons and watts wasted with Hadoop-ish stuff (map-reducing without filters, Java itself for most implementations) - but still I wouldn't recommend to most CIOs they replace Hadoop in all or maybe even most cases, even for few-TB data sets and smaller.

Both because of familiarity with querying and the solidity of running a multi-tenant system.

But I do recommend that they switch to MapR [c++ core and a passable central FS for unix-based super fast queries] if they're concerned with efficiency.

[For context, in my day job we do multiple clusters of millions of network traffic summaries/sec and are often replacing Hadoop, or more recently, ELK, as people tried to use them for that use case. All well >>> will fit in ram. We have our own in-house column-store + streaming combo db done in go/c/c++ that started as clustering fastbit.]

> Many of the alternates (including linux cli stuff) that are much faster require a re-thinking of attitude, don't work where there are tens to hundreds of people submitting queries, or require different skills.

I doubt the point of the article was to suggest that linux cli stuff would scale to hundreds of users on the same host, but, if each of those users has a host of their very own, such as a laptop, the model could scale very well indeed, for small enough datasets.

> I wouldn't recommend to most CIOs they replace Hadoop in all or maybe even most cases, even for few-TB data sets and smaller.

Well, as you point out later, regarding familiarity, once it's in, it's probably too late. What about for a new implementation?

In answering this question, don't get too hung up on a literal interpretation of "single" server being exactly one. For example, a traditional RDBMS with one or more replicas (for performance, redundancy, or both) would still fall under the single server model. Really, it's about the non-distributed-computing option.

> if they're concerned with efficiency.

The fact that this is an "if" (and I do know that it is, even for startups) is bewildering to me, even more so in the context of distributed architectures where scaling is less linear the more data that has to be shared.

Absolutely a believer in the power of single machine analytics on GB or TB vs PB datasets. But I've found that whether it's a new data system or just showing csv/tsv tool magic, if people aren't into data systems (for the former) or unix cli (for the latter), it's a whole culture/training thing vs. most technically efficient wins.

So I think the latter issues dominate (culture/training) for new implementations as well as existing.

Re: efficiency overall, and of distributed systems, it's interesting to see both that MapR has made a business selling more efficient Hadoop-ishness, but also depressing how infrequently I see them deployed, even for pretty massive clusters.

Unfortunately, it's hard to see a startup focusing on single-machine solutions for GB->TB sets (even scaled out with replicas) as the way many in the space get started is with an open core model and the thing they charge for is the clustering and/or monitoring needed to become a distributed system.

But... I am optimistic we'll see a generational effect over the next 10 years in openness/interest in composable ad-hoc analytics tools, especially with Windows incorporating unix cli components.

> MapR has made a business selling more efficient Hadoop-ishness, but also depressing how infrequently I see them deployed, even for pretty massive clusters.

I'm only very slightly familiar with their features/value-add and not at all with their pricing. Could the pricing model be particularly unpalatable for some reason?

Not that I expect there has to be a deeper reason beyond simply not caring about cost/efficiency. I've certainly both seen and heard described plenty of Hadoop installations that seemed to have missed the "cheap" point in Google's M-R paper and subsequent Hadoop hardware selection advice from, for example, Hortonworks, or misunderstood what it meant. There may also be some misunderstanding of "commodity" or "industry standard" to mean server hardware of a certain "class" (such as brand name or with redundancy features), even if it conflicts with cheapness.

Some of it may be that the hardware selection advice articles (e.g. Hortonworks, Cloudera) are very old, with excellent general advice, but potentially misleading specific numbers. Even extrapolating from those numbers in a naive way can easily lead to needless expense and/or sub-optimal performance (that time some Xeons had 3, not 2, not 4, memory channels).

The latest article I found in an (admittedly quick) search was https://hadoopoopadoop.com/2015/09/22/hadoop-hardware/ from late 2015, which is still remarkably long ago and is rather verbose.

> My own understanding of the term "well understood" is that it is synonymous with "widely understood". If many people are still making the mistake of using Hadoop when those costs outweight the benefits, it seems that understanding isn't quite wide enough.

I's entirely possible that 9 out of 10 people on a project using Hadoop know it's a waste but there are non-technical reasons for doing so. Resume padding and PHB demanding some technology of the week would be the two most common ones.

That said it probably is contradictory much of the time. I'd say the majority of current developers don't know about the simpler tools.