Hacker News new | ask | show | jobs
by quantumhobbit 3410 days ago
I think the conundrum comes up from the "donut hole" of medium sized data. For 1TB use a script and a laptop; for 100TB use Spark running on dozens to hundreds of machines.

The problem is exactly that 8-9TB range because running spark on just two or three machines will be slower than on a laptop with an extra external drive. You need to scale up into potentially dozens of machines just to get the same performance you were getting on a laptop. You were ok with a laptop, add more data and now you have a not insignificant AWS bill, unless you are ok puttering around on a few machines much more slowly than on the laptop.

There is no middle ground solution, so everyone starts with a overkill solution that scales out of fear of getting stuck on one machine when the dataset grows. But most of these systems never grow enough to need to scale this way. So we are wasting resources running toy clusters on problems that would fit on a laptop.

Maybe I am becoming a cranky old man who yells at clouds, but I miss MPI. It had no frills but it runs with next to no overhead and scales up to super computers with no donut hole in between.

2 comments

If a distributed software running on multiple high end expensive servers cannot beat another solution running on a single laptop with a cheap external hard drives, the issue is not distributed systems, the issue is that that specific software is crap.
There will always be some overhead, but yes it seems like some of these frameworks are pretty bloated.
Processing data wouldn't be the problem with 2 socket xeons neither would it be putting 3 or 5 Hdd on a raid5. Getting the 32TB in, however, would take at least 8 hours at 10Gbps saturated, if your disks can write that fast.
10gigE seams fast, but in reality it's only 1.25GB/s in an ideal case. One enterprise PCIe SSD drive will saturate that. Or 5x of the old style 3.5 inch 7.2k RPM drives (you can fit 12 of these in a dense 1U case).

That why you see 40gigE or 56gigE used in HPC.

Its 25G, 50G or 100G today.