Hacker News new | ask | show | jobs
by Loic 3410 days ago
The question is more, what do you want as result? Suppose you search in your 8TB database of molecules the 1000 molecules most similar to a given one, you have 16 cores, you cut the 8TB in 500GB skunks, preload continuously 1GB of molecules per core and accumulate 16*1000 molecules and merge at the end. You can do it on a single system and you work with a TB size dataset.

It means that the size of the dataset is not the only factor, you need to take into account the operations performed on each "element/document", the size of the intermediate datasets and the size of the final results and some more stuff (encoding, etc.).

1 comments

I see your point -- size alone doesn't matter, it's how you use it :-).

But how do things change when the dataset grows to 9GB? Now we need more than one HD. Hadoop + Spark is built for this exact use case...

I think the conundrum comes up from the "donut hole" of medium sized data. For 1TB use a script and a laptop; for 100TB use Spark running on dozens to hundreds of machines.

The problem is exactly that 8-9TB range because running spark on just two or three machines will be slower than on a laptop with an extra external drive. You need to scale up into potentially dozens of machines just to get the same performance you were getting on a laptop. You were ok with a laptop, add more data and now you have a not insignificant AWS bill, unless you are ok puttering around on a few machines much more slowly than on the laptop.

There is no middle ground solution, so everyone starts with a overkill solution that scales out of fear of getting stuck on one machine when the dataset grows. But most of these systems never grow enough to need to scale this way. So we are wasting resources running toy clusters on problems that would fit on a laptop.

Maybe I am becoming a cranky old man who yells at clouds, but I miss MPI. It had no frills but it runs with next to no overhead and scales up to super computers with no donut hole in between.

If a distributed software running on multiple high end expensive servers cannot beat another solution running on a single laptop with a cheap external hard drives, the issue is not distributed systems, the issue is that that specific software is crap.
There will always be some overhead, but yes it seems like some of these frameworks are pretty bloated.
Processing data wouldn't be the problem with 2 socket xeons neither would it be putting 3 or 5 Hdd on a raid5. Getting the 32TB in, however, would take at least 8 hours at 10Gbps saturated, if your disks can write that fast.
10gigE seams fast, but in reality it's only 1.25GB/s in an ideal case. One enterprise PCIe SSD drive will saturate that. Or 5x of the old style 3.5 inch 7.2k RPM drives (you can fit 12 of these in a dense 1U case).

That why you see 40gigE or 56gigE used in HPC.

Its 25G, 50G or 100G today.