| > calling external APIs probably isn't something you want to be doing in Hadoop anyway And yet I've seen it done. At least they weren't truly external, just external to Hadoop. > In the former case, we're taking rapid steps towards YARN anyways. I'm sure non-Hadoop, pre-packaged systems exist that can handle this, although I'm not personally familiar with them There's those goggles again :) Batch schedulers have existed for decades (e.g. PBS started in '91). > (such as smooth scaling without having to think about hardware), This neatly embodies what I believe is the primary fallacy in most of the decision making, including the fact that it's often parenthetical (i.e. a throway assumption). Does anyone really value the "smoothness" of scaling? I'd expect the important feature to be, instead, that the slope of the curve doesn't decrease too fast. The notion that Hadoop someone frees one from having to think about hardware flies in the face of hardware sizing advice from, Cloudera, Hortonworks, and others that discuss sizing node resources based on workload (mostly i/o and ram) expectations and heterogenous clusters. It does, however, explain my observation, in the wild of clusters built out of nodes that seem undersized in all respects for any workload. >It's now really easy to spin up a cluster, It's really easy if it's already there? That borders on the tautological. Or are you talking about an additional cluster in an organization where there already is one? >and with small scale, costs are not that big a deal. That's just too broad a generalization, just as "cost is a big deal" would be. Cost is always a factor, just not always the biggest one. Small scale is often (though not always) associated with limited funds. Doubling, or even tacking on 50% to, the cost could be catastrophic to a seed startup or a scientific researcher. >With large scale, your system is distributed in some way anyway. This strikes me as little more than a version of the slippery slope fallacy. Even some web or app servers behind a load balancers could be considered distributed [1], but that doesn't make them a good candidate for anything that's actually considered a distribute framework. It also hand-waves away the problem that, even if costs weren't a big deal at small scale, they don't somehow magically become less of an issue at large scale. Paying a 50% "tax" on $40k is one thing. At $400k, you could have hired someone to think about hardware for a year, instead. [1] I just recently pointed out, slightly tongue-in-cheeck, an architectural similarity, one layer down, between app server and database https://news.ycombinator.com/item?id=17521817 |