|
|
|
|
|
by potatoyogurt
2894 days ago
|
|
hah, you're right, I definitely am looking at it with distributed system goggles on. With just one machine, yes, it will probably make sense to fully process one job before moving onto another (although there are some cases where this may not be true, such as if you're limited by having to wait for external APIs. But this isn't a strong argument because calling external APIs probably isn't something you want to be doing in Hadoop anyway).
If your workload scales, though, you will eventually have to spread out onto multiple machines anyway. At that point, even if you process each job on a single machine, you either need a system to coordinate job allocation across machines, or you have to accept some slack if you only schedule jobs on individual machines. In the former case, we're taking rapid steps towards YARN anyways. I'm sure non-Hadoop, pre-packaged systems exist that can handle this, although I'm not personally familiar with them, but my point is just that the more steps we take towards the complexity of a distributed computing model, the more it starts to seem reasonable to just say "okay, let's just use hadoop/spark/etc. and get the extra benefits they offer as well (such as smooth scaling without having to think about hardware), even if it's more expensive than a setup with individual nodes optimized for our purpose." It's now really easy to spin up a cluster, and with small scale, costs are not that big a deal. With large scale, your system is distributed in some way anyway. I don't think it's an obvious decision, and you're absolutely right that people are usually too quick to jump to distributed systems when they really don't need them. But I think there are a lot of arguments for using something like Hadoop even before it's strictly necessary. I think part of the disconnect is that we have different backgrounds, so we both look at different things and think "oh, that's easy" vs "oh, thats like a pain." |
|
> I don't think it's an obvious decision
Certainly, which is why I even bother with discussion like this, in the hopes of making the decision clearer (of not obvious) to me in the future.
A response to my footnoted (in my other comment) comment pointed out how oversimplified my understanding of distributed databases was. Well, I knew it was an oversimplification, but not in which way.
There's plenty of computer science research from the 70s and 80s covering these topics, but they're both tough to translate to practical considerations, and they're woefully out of date (e.g. don't account for SSDs or cheap commodity hardware).
> But I think there are a lot of arguments for using something like Hadoop even before it's strictly necessary.
Well, philisophically, I would disagree with such an assertion on the grounds of premature optimization, absent the "strictly".
I would advocate for switching from scaling "up" (aka "vertically", larger single machines) to scaling "out" (aka "horizontally" or distributed) around the point of cost parity, not at the point it is no longer possible to scale up a single machine (unless that point can reasonably be expected to occur first, I suppose).
> I think part of the disconnect is that we have different backgrounds, so we both look at different things and think "oh, that's easy" vs "oh, thats like a pain."
That would account for any overestimation of how difficult it is to work with hardware or how complex Hadoop is set up, administer, or use. Those are just initial conditions and may well unduly influence decision making that has far longer-lasting consequences.
However, I'd like to think I'm not often guilty of the latter overestimation when discussing solutions (and even advocating single-server), as I tend to assume that it can it least become easy enough for anyone out there, so long as the technology is popular enough (like Hadoop) or traditional/mature enough (like the tools in the original comment, or PBS) that plenty of documentation and/or experts exist.
My background also includes having seen, first-hand, over decades, various attempts at distributed processing and databases in practice, with varying degrees of success. This has included early "universal" filesystems like AFS, "sharding" MySQL to give it "web scale" performance [1], Glustre and its ilk, some NoSQLs, and of course Hadoop.
If anything, I'd say that, with most popular, new technologies, especially ones predicated on performance or scale, "it's a pain" is not the knee-jerk skeptical reaction my experience has ingrained in me. Rather, it's more like "sure, it's easy now, but you'll pay." TANSTAAFL.
[1] This worked well enough but did have a high up-front engineering cost and a high on-going operating cost for the large number of medium-small servers plus larger than otherwise needed app servers to do DB operations that could no longer be done inside the database because each one had incomplete data. Due to effort overestimation, it was unthinkable to move from a VPS to a colo so as to get a medium-large single DB server with enough attached storage to break the "web scale" I/O bottleneck for years to come.