|
|
|
|
|
by brians
3409 days ago
|
|
Hm. Size is mere size. Latency will never improve with a "big data" solution over one machine with in-RAM data. Dependency management? You're going to declare it once and impose it everywhere anyway. Scheduling? Again, one machine with in-RAM data will always win. That leaves resiliency and etc. I can't answer etc., but—how is resilience helped with a big data solution? That seems like Lampson's distributed system: more machines, but you need k-of-n, k>1. Better to just mirror to two machines with the data in RAM. |
|
If your scheduling involves running jobs that must wait on dependencies or events for a long time (hours, days), a hardware failure or some other anomaly can be catastrophic, whereas a "big data" framework can recover without your even knowing about it.
At the end of the day it just comes down to use cases. There are a LOT of other use cases that "big data" platforms address other than being able to fit data in RAM. Sometimes flying by the seat of your pants on one host doesn't cut it for business-critical processing.
> how is resilience helped with a big data solution?
The "R" in Spark's RDD abstraction is for "Resilient". Node failures and replication failures can be recovered without you even knowing it.
Sure, you can write all this stuff from scratch every time you encounter them (mirror data on hosts, run embarrassingly-parallel algorithms across a fleet of hosts, write your own DB-backed scheduling system, etc.), but all these are solved problems in these big data frameworks. You'll be wasting tons of time reinventing the wheel. I've been there.