| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dajohnson89 3457 days ago
	I'm not sure why you got downvoted. It's a valid point and it makes intuitive sense. Is there a clear cut answer, as to whether one should choose a distributed solution or not? It seems to me that if you're at the Terabyte scale, choosing non-distributing seems to be asking for trouble. A quick search indicates the largest HDD you can buy is around 8TB.

1 comments

Loic 3457 days ago

The question is more, what do you want as result? Suppose you search in your 8TB database of molecules the 1000 molecules most similar to a given one, you have 16 cores, you cut the 8TB in 500GB skunks, preload continuously 1GB of molecules per core and accumulate 16*1000 molecules and merge at the end. You can do it on a single system and you work with a TB size dataset.

It means that the size of the dataset is not the only factor, you need to take into account the operations performed on each "element/document", the size of the intermediate datasets and the size of the final results and some more stuff (encoding, etc.).

link

dajohnson89 3457 days ago

I see your point -- size alone doesn't matter, it's how you use it :-).

But how do things change when the dataset grows to 9GB? Now we need more than one HD. Hadoop + Spark is built for this exact use case...

link

quantumhobbit 3457 days ago

I think the conundrum comes up from the "donut hole" of medium sized data. For 1TB use a script and a laptop; for 100TB use Spark running on dozens to hundreds of machines.

The problem is exactly that 8-9TB range because running spark on just two or three machines will be slower than on a laptop with an extra external drive. You need to scale up into potentially dozens of machines just to get the same performance you were getting on a laptop. You were ok with a laptop, add more data and now you have a not insignificant AWS bill, unless you are ok puttering around on a few machines much more slowly than on the laptop.

There is no middle ground solution, so everyone starts with a overkill solution that scales out of fear of getting stuck on one machine when the dataset grows. But most of these systems never grow enough to need to scale this way. So we are wasting resources running toy clusters on problems that would fit on a laptop.

Maybe I am becoming a cranky old man who yells at clouds, but I miss MPI. It had no frills but it runs with next to no overhead and scales up to super computers with no donut hole in between.

link

user5994461 3457 days ago

If a distributed software running on multiple high end expensive servers cannot beat another solution running on a single laptop with a cheap external hard drives, the issue is not distributed systems, the issue is that that specific software is crap.

link

quantumhobbit 3457 days ago

There will always be some overhead, but yes it seems like some of these frameworks are pretty bloated.

link

makapuf 3457 days ago

Processing data wouldn't be the problem with 2 socket xeons neither would it be putting 3 or 5 Hdd on a raid5. Getting the 32TB in, however, would take at least 8 hours at 10Gbps saturated, if your disks can write that fast.

link

mtanski 3457 days ago

10gigE seams fast, but in reality it's only 1.25GB/s in an ideal case. One enterprise PCIe SSD drive will saturate that. Or 5x of the old style 3.5 inch 7.2k RPM drives (you can fit 12 of these in a dense 1U case).

That why you see 40gigE or 56gigE used in HPC.

link

olavgg 3457 days ago

Its 25G, 50G or 100G today.

link