| Depends on your definition of 'larger' -- if this data is on S3 currently I can't imagine we're talking multi-TB working sets here? Generally speaking, HDFS is going to be a clusterfuck to support unless you give a load of cash to cloudera (actually, it will be regardless but slightly better with the bill) -- even then you'll get the typical db vendor line of 'not running -some patchset ver-, then upgrade. Which is really risky on a large cluster which pretty much works as you want. Also, unless you've got a load of hardware you can dedicate to this environment, then you're going to be spending a lot of money on IAAS bills and your performance is probably not going to be very good. (Yeah sure you can virtualize HDFS but generally I passthrough local storage to the VM's, and only run demo on AWS etc). There was been a push towards such mental complexity and folks convincing themselves they needed to solve their problems in this manner, and now a bit of an ebb backwards (at least, in the general space) now that your avg deployer found out how hard it is to do this stuff even with good support. Massive data ingestion and huge batch jobs might be a solution to a given problem you have, but it's probably not the only one whereas it's almost certainly going to be the most difficult and expensive. Personally, I'd avoid hdfs, flume, hfs, zookeeper and all the rest of the nightmares until you're absolutely sure that you need them (and if you're not already, then you probably don't). Also: Check out manta from joyent. :} |
That should be the de-factor standard for TB scale. In fact, don't bother comparing other products if you're TB scale, just use S3.