| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by arjie 3566 days ago

> Building one or two very beefy machines is 1000x faster, and a lot cheaper than a Hadoop setup.

You have a few petabytes of data and your working set is 50 TB. You put it on two machines. All your data is now on these SGI UV 3000s or whatever. You now need a bunch of experts because any machine failure is a critical data loss situation, and a throughput cliff situation. Took a fairly low-stakes situation (disk failure, let's say) and transformed it into the Mother of All Failures.

And then you've decided that next year your working set won't be over the max for the particular machine type you've decided on. What will you even do then? Sell this and get the new bigger machine? Hook them both up and run things in a distributed fashion?

And then there's the logistics of it all. You're going to use tooling to submit jobs on this machine, and there's got to be configurable process limits, job queues, easy scheduling and rescheduling, etc. etc.

I mean, I'm firmly in the camp that lots of problems are better solved on the giant 64 TB, 256 core machines, but you're selling an idea that has a lot of drawbacks.

3 comments

qwertyuiop924 3566 days ago

And people with 64TB, 256 core machines don't have RAID arrays attached to their machine for this exact reason?

If it's "machines" plural, than you can do replication between the two. There's your fallover in case of complete failure.

link

yid 3565 days ago

> If it's "machines" plural, than you can do replication between the two.

This is the start of a scaling path that winds down Distributed Systems Avenue, and eventually leads to a place called Hadoop.

(Replication and consensus are remarkably difficult problems that Hadoop solves).

link

qwertyuiop924 3565 days ago

Fair 'nuff, but if you don't distribute compute, and you store the dataset locally on all the systems (not necessarily the results of mutations, just the datasets and the work that's being done), you'll still possibly reap massive perf gains over Hadoop in certain contexts.

link

yid 3564 days ago

> you'll still possibly reap massive perf gains over Hadoop in certain contexts.

Certainly, and unfortunately, the exact point at which Hadoop becomes the better option over big iron is generally an ongoing debate and shifting target. But there's no doubt that such a point actually exists.

link

qwertyuiop924 3564 days ago

...I'm not sure if it does. Bain's Law still stands.

But if it does, then it's a pretty big chunk of data, and a very fast network.

link

yid 3564 days ago

Couldn't find anything for Bain's Law, do you have a reference I could follow?

link

faragon 3566 days ago

Both disk and CPU failures are recoverable on expensive hardware.

link

nickpsecurity 3566 days ago

"You have a few petabytes of data and your working set is 50 TB. You put it on two machines. All your data is now on these SGI UV 3000s or whatever. "

There's usually a combination of apps that work within the memory of the systems plus huge amount of external storage with a clustered filesystem, RAID, etc. Example supercomputer from SGI below since you brought them up that illustrates how they separate compute, storage, management and so on. Management software is available for most clusters to automate or make easy a lot of what you described in later paragraph. They use one. It was mostly a solved problem over a decade ago with sometimes one or two people running supercomputer centers at various universities.

http://www.nas.nasa.gov/hecc/resources/pleiades.html

link

nl 3566 days ago

Yes, but old-school MPI style supercomputer clusters are closer to Hadoop style clusters than standalone machines for the purpose of this discussion.

Both have mechanisms for doing distributed processing on data that is too big for a single machine.

The original argument was that command line tools etc are sufficient. In both these cases they aren't.

link