Hacker News new | ask | show | jobs
by cle 3410 days ago
I think this is only half the story.

There are other use cases other than mere size that can necessitate "big data" solutions. E.g. timeliness, resiliency, maintainability...

If you are building production data processing systems that have constraints on data size, latency, resiliency, scheduling, dependency management, etc., you might be better off with a "big data" system. Even if the data could all fit on a beefy box. This was a painful lesson for me to learn.

1 comments

Hm. Size is mere size. Latency will never improve with a "big data" solution over one machine with in-RAM data. Dependency management? You're going to declare it once and impose it everywhere anyway. Scheduling? Again, one machine with in-RAM data will always win.

That leaves resiliency and etc. I can't answer etc., but—how is resilience helped with a big data solution? That seems like Lampson's distributed system: more machines, but you need k-of-n, k>1. Better to just mirror to two machines with the data in RAM.

Latency does improve if you download your data in parallel across a cluster. Or if you're running many iterations of an algorithm over GBs of data thousands of times, and each iteration is independent--you can save hours or days by performing them in parallel on a cluster.

If your scheduling involves running jobs that must wait on dependencies or events for a long time (hours, days), a hardware failure or some other anomaly can be catastrophic, whereas a "big data" framework can recover without your even knowing about it.

At the end of the day it just comes down to use cases. There are a LOT of other use cases that "big data" platforms address other than being able to fit data in RAM. Sometimes flying by the seat of your pants on one host doesn't cut it for business-critical processing.

> how is resilience helped with a big data solution?

The "R" in Spark's RDD abstraction is for "Resilient". Node failures and replication failures can be recovered without you even knowing it.

Sure, you can write all this stuff from scratch every time you encounter them (mirror data on hosts, run embarrassingly-parallel algorithms across a fleet of hosts, write your own DB-backed scheduling system, etc.), but all these are solved problems in these big data frameworks. You'll be wasting tons of time reinventing the wheel. I've been there.

Is it even possible to get even 10TB of ram on a single commodity server?
The Super Micro SuperServer 8048B-TR4FT lists that it supports up to 12TB DDR4 ECC RAM (which could have 4xE7-8890v4 for 96 cores / 192 threads). Close to a commodity server, but probably doesn't quite count. Taking a wild guess on the price - $250k-$350k?

The SuperServer 7088B-TR4FT lists that it supports 24TB DDR4 ECC RAM (with 8xE78890v4 for 192 cores / 384 threads).