|
|
|
|
|
by potatoyogurt
2896 days ago
|
|
Just replying to the top bit, since we're exchanging multiple comment chains. Hadoop, Spark, etc. are built on top of a job scheduler, so if you have multiple competing jobs, it will allocate tasks to resources in a fairly reasonable way. If there is too much contention at some point in time, your jobs will all still run, and portions that can't be scheduled immediately will be delayed while they wait for free nodes. The processing model also generally leads to independence between jobs. You can do job scheduling on a single machine too, of course, but you get it for free with Hadoop/YARN. You don't get it for free with chained bash scripts or a custom single-node solution. And without sophisticated job scheduling, you can't really come too close to fully allocating your machine (when you're not watching carefully), because you'll accidentally exhaust resources and break random jobs. edit: if you're running on a cloud, you also may be able to autoscale to deal with spiky usage patterns. |
|
I'd venture to say that "for free" is a bit of a stretch, other than under the "already paid for" definition of "free". Being built on top of a job scheduler means one always has to exert the effort/overhead of dealing with a scheduler, even for the simplest case.
On a single machine, you can decide if you want to add the complexity of something like LSF or not. (I would call a custom single-node solution a strawman in the face of pre-Hadoop schedulers. Batch processing isn't exactly new).
> And without sophisticated job scheduling, you can't really come too close to fully allocating your machine (when you're not watching carefully), because you'll accidentally exhaust resources and break random jobs.
I'm pretty sure I'm still missing something. Since my goal is to gain a better understanding, I'll put forward where I believe the disconnect is (but definitely, anyone, correct me if my assumptions are off):
You're coming from a distributed model, where the assumption is that "resources" are divisible and allocatable in such a way that, for any given job, there's an effective maximum number of nodes it can use (e.g. network-i/o-bound job that would waste any additional CPUs). Any leftover nodes need other jobs allocated to them to reach full utilization.
I, however, am considering that the single-server model assumes that there won't be any obvious bottlenecks or limiting resources, instead prefering to run a single job at one time, as fast as possible. This would completely obviate the need for any complex scheduling and avoids accidental resource exhaustion (which, I suppose, is really only disk or memory).