| > Building one or two very beefy machines is 1000x faster, and a lot cheaper than a Hadoop setup. You have a few petabytes of data and your working set is 50 TB. You put it on two machines. All your data is now on these SGI UV 3000s or whatever. You now need a bunch of experts because any machine failure is a critical data loss situation, and a throughput cliff situation. Took a fairly low-stakes situation (disk failure, let's say) and transformed it into the Mother of All Failures. And then you've decided that next year your working set won't be over the max for the particular machine type you've decided on. What will you even do then? Sell this and get the new bigger machine? Hook them both up and run things in a distributed fashion? And then there's the logistics of it all. You're going to use tooling to submit jobs on this machine, and there's got to be configurable process limits, job queues, easy scheduling and rescheduling, etc. etc. I mean, I'm firmly in the camp that lots of problems are better solved on the giant 64 TB, 256 core machines, but you're selling an idea that has a lot of drawbacks. |
If it's "machines" plural, than you can do replication between the two. There's your fallover in case of complete failure.