Hacker News new | ask | show | jobs
by bartread 2899 days ago
> If many people are still making the mistake of using Hadoop when those costs outweight the benefits, it seems that understanding isn't quite wide enough.

That's probably true but, cynically, I tend to think a lot of the time when people use Hadoop they're not doing it because Hadoop is the best solution: they're doing it because the solution allows them to use Hadoop.

2 comments

I'd venture that's a little too cynical, as it violates the "never attribute to malice what can adequately be explained by ineptitude" rule :)

I'm not suggesting that something like resume-padding is never a motivation, just that it seems unlikely to be the sole or even primary motivation.

Now, I may be wrong, but my reasoning behind this is that, for all its purported power as a tool, as potatoyogurt has alluded [1], it's really only the true experts who can wield that power effectively. The ones who merely succeeded at getting on their resume would only be able to use it where its power isn't truly needed (as was the case before).

If the technology falls out of fashion, then those resume-holders will be in a position of needing to deliver the cargo, rather than carved wooden headphones.

As such, I believe their motivation is actually to attempt to learn the true power of the tool (i.e. true resume building, rather than mere padding) and that they're grossly underestimating the costs through ignorance and a desire to learn by doing.

The trouble is, without well-published non-distributed reference implementations and, instead, ultra-popularity of the distributed tools instead, they never end up learning those costs, and we're in a state of perpetuated ignorance and perpetuated over-use.

[1] by saying that the cost trade-offs are well understood by the experts, which strongly implies that the experts have a pretty deep understanding of the actual mechanics of distributed computing in general.

Many of the alternates (including linux cli stuff) that are much faster require a re-thinking of attitude, don't work where there are tens to hundreds of people submitting queries, or require different skills. It's tragic to think of all of the computrons and watts wasted with Hadoop-ish stuff (map-reducing without filters, Java itself for most implementations) - but still I wouldn't recommend to most CIOs they replace Hadoop in all or maybe even most cases, even for few-TB data sets and smaller.

Both because of familiarity with querying and the solidity of running a multi-tenant system.

But I do recommend that they switch to MapR [c++ core and a passable central FS for unix-based super fast queries] if they're concerned with efficiency.

[For context, in my day job we do multiple clusters of millions of network traffic summaries/sec and are often replacing Hadoop, or more recently, ELK, as people tried to use them for that use case. All well >>> will fit in ram. We have our own in-house column-store + streaming combo db done in go/c/c++ that started as clustering fastbit.]

> Many of the alternates (including linux cli stuff) that are much faster require a re-thinking of attitude, don't work where there are tens to hundreds of people submitting queries, or require different skills.

I doubt the point of the article was to suggest that linux cli stuff would scale to hundreds of users on the same host, but, if each of those users has a host of their very own, such as a laptop, the model could scale very well indeed, for small enough datasets.

> I wouldn't recommend to most CIOs they replace Hadoop in all or maybe even most cases, even for few-TB data sets and smaller.

Well, as you point out later, regarding familiarity, once it's in, it's probably too late. What about for a new implementation?

In answering this question, don't get too hung up on a literal interpretation of "single" server being exactly one. For example, a traditional RDBMS with one or more replicas (for performance, redundancy, or both) would still fall under the single server model. Really, it's about the non-distributed-computing option.

> if they're concerned with efficiency.

The fact that this is an "if" (and I do know that it is, even for startups) is bewildering to me, even more so in the context of distributed architectures where scaling is less linear the more data that has to be shared.

Absolutely a believer in the power of single machine analytics on GB or TB vs PB datasets. But I've found that whether it's a new data system or just showing csv/tsv tool magic, if people aren't into data systems (for the former) or unix cli (for the latter), it's a whole culture/training thing vs. most technically efficient wins.

So I think the latter issues dominate (culture/training) for new implementations as well as existing.

Re: efficiency overall, and of distributed systems, it's interesting to see both that MapR has made a business selling more efficient Hadoop-ishness, but also depressing how infrequently I see them deployed, even for pretty massive clusters.

Unfortunately, it's hard to see a startup focusing on single-machine solutions for GB->TB sets (even scaled out with replicas) as the way many in the space get started is with an open core model and the thing they charge for is the clustering and/or monitoring needed to become a distributed system.

But... I am optimistic we'll see a generational effect over the next 10 years in openness/interest in composable ad-hoc analytics tools, especially with Windows incorporating unix cli components.

> MapR has made a business selling more efficient Hadoop-ishness, but also depressing how infrequently I see them deployed, even for pretty massive clusters.

I'm only very slightly familiar with their features/value-add and not at all with their pricing. Could the pricing model be particularly unpalatable for some reason?

Not that I expect there has to be a deeper reason beyond simply not caring about cost/efficiency. I've certainly both seen and heard described plenty of Hadoop installations that seemed to have missed the "cheap" point in Google's M-R paper and subsequent Hadoop hardware selection advice from, for example, Hortonworks, or misunderstood what it meant. There may also be some misunderstanding of "commodity" or "industry standard" to mean server hardware of a certain "class" (such as brand name or with redundancy features), even if it conflicts with cheapness.

Some of it may be that the hardware selection advice articles (e.g. Hortonworks, Cloudera) are very old, with excellent general advice, but potentially misleading specific numbers. Even extrapolating from those numbers in a naive way can easily lead to needless expense and/or sub-optimal performance (that time some Xeons had 3, not 2, not 4, memory channels).

The latest article I found in an (admittedly quick) search was https://hadoopoopadoop.com/2015/09/22/hadoop-hardware/ from late 2015, which is still remarkably long ago and is rather verbose.