Hacker News new | ask | show | jobs
by cyberpunk 3478 days ago
Depends on your definition of 'larger' -- if this data is on S3 currently I can't imagine we're talking multi-TB working sets here?

Generally speaking, HDFS is going to be a clusterfuck to support unless you give a load of cash to cloudera (actually, it will be regardless but slightly better with the bill) -- even then you'll get the typical db vendor line of 'not running -some patchset ver-, then upgrade. Which is really risky on a large cluster which pretty much works as you want.

Also, unless you've got a load of hardware you can dedicate to this environment, then you're going to be spending a lot of money on IAAS bills and your performance is probably not going to be very good. (Yeah sure you can virtualize HDFS but generally I passthrough local storage to the VM's, and only run demo on AWS etc).

There was been a push towards such mental complexity and folks convincing themselves they needed to solve their problems in this manner, and now a bit of an ebb backwards (at least, in the general space) now that your avg deployer found out how hard it is to do this stuff even with good support. Massive data ingestion and huge batch jobs might be a solution to a given problem you have, but it's probably not the only one whereas it's almost certainly going to be the most difficult and expensive.

Personally, I'd avoid hdfs, flume, hfs, zookeeper and all the rest of the nightmares until you're absolutely sure that you need them (and if you're not already, then you probably don't).

Also: Check out manta from joyent. :}

1 comments

S3 is ideal for multi TB working set.

That should be the de-factor standard for TB scale. In fact, don't bother comparing other products if you're TB scale, just use S3.

Really?

Say you're going to ETL or Map/Reduce over all that data a lot of times, you're telling me that reading it all for processing over S3's rest api (which is the only method?) instead of, say, a local array of 15k sas's over pcie hba's is ideal?

It's pretty expensive and inefficient to my eyes, what am I missing? I

In what way would S3 be better than running this on your own gear if cost and perf are clearly not going to be better (which are really the big factors in this decision)?

You're missing that S3 is the storage system for RedShift and EMR (emr = managed hadoop on AWS).

They are pretty cheap, efficient and simple to use ;)