| > You can do job scheduling on a single machine too, of course, but you get it for free with Hadoop/YARN. You don't get it for free with chained bash scripts or a custom single-node solution. I'd venture to say that "for free" is a bit of a stretch, other than under the "already paid for" definition of "free". Being built on top of a job scheduler means one always has to exert the effort/overhead of dealing with a scheduler, even for the simplest case. On a single machine, you can decide if you want to add the complexity of something like LSF or not. (I would call a custom single-node solution a strawman in the face of pre-Hadoop schedulers. Batch processing isn't exactly new). > And without sophisticated job scheduling, you can't really come too close to fully allocating your machine (when you're not watching carefully), because you'll accidentally exhaust resources and break random jobs. I'm pretty sure I'm still missing something. Since my goal is to gain a better understanding, I'll put forward where I believe the disconnect is (but definitely, anyone, correct me if my assumptions are off): You're coming from a distributed model, where the assumption is that "resources" are divisible and allocatable in such a way that, for any given job, there's an effective maximum number of nodes it can use (e.g. network-i/o-bound job that would waste any additional CPUs). Any leftover nodes need other jobs allocated to them to reach full utilization. I, however, am considering that the single-server model assumes that there won't be any obvious bottlenecks or limiting resources, instead prefering to run a single job at one time, as fast as possible. This would completely obviate the need for any complex scheduling and avoids accidental resource exhaustion (which, I suppose, is really only disk or memory). |
I don't think it's an obvious decision, and you're absolutely right that people are usually too quick to jump to distributed systems when they really don't need them. But I think there are a lot of arguments for using something like Hadoop even before it's strictly necessary. I think part of the disconnect is that we have different backgrounds, so we both look at different things and think "oh, that's easy" vs "oh, thats like a pain."