| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cle 3457 days ago

Latency does improve if you download your data in parallel across a cluster. Or if you're running many iterations of an algorithm over GBs of data thousands of times, and each iteration is independent--you can save hours or days by performing them in parallel on a cluster.

If your scheduling involves running jobs that must wait on dependencies or events for a long time (hours, days), a hardware failure or some other anomaly can be catastrophic, whereas a "big data" framework can recover without your even knowing about it.

At the end of the day it just comes down to use cases. There are a LOT of other use cases that "big data" platforms address other than being able to fit data in RAM. Sometimes flying by the seat of your pants on one host doesn't cut it for business-critical processing.

> how is resilience helped with a big data solution?

The "R" in Spark's RDD abstraction is for "Resilient". Node failures and replication failures can be recovered without you even knowing it.

Sure, you can write all this stuff from scratch every time you encounter them (mirror data on hosts, run embarrassingly-parallel algorithms across a fleet of hosts, write your own DB-backed scheduling system, etc.), but all these are solved problems in these big data frameworks. You'll be wasting tons of time reinventing the wheel. I've been there.