Hacker News new | ask | show | jobs
by mastratton3 3481 days ago
We're actually having a debate now as we're starting to process larger datasets as to whether or not we should keep everything on S3 or start using HDFS w/ Hive. I'm curious if you guys considered HDFS and why you decided to go strictly with S3, and additionally, are there any issues you encounter with S3.
3 comments

We've considered HDFS, but we really liked the idea of having compute only clusters and have our data kept completely separate. Clusters failure happen and having data on S3 makes us worry less if a cluster goes down. Just spin up a new one and you're good to go.

There is a bit of more latency when using S3 compared to HDFS, but it's not bad and the benefits overcame that. We do have a couple of jobs that store some intermediate results in HDFS, but in the end everything lands in S3.

We encountered a few issues with S3 at the beginning mostly around the eventual consistency, but nothing that could not be fixed.

netflix i think said they see about a 10% perf hit using s3 instead of hdfs, using emr where they launch temporary clusters that do a job and shut down, and that performance cost was well worth the flexibility of being to launch independent clusters whenever they need.
We're also using S3 but we have a hybrid approach to the problem. The event data is immutable and you use instance stores with EC2 and cache the data to local SSDs and use S3 as backups. The thoughtput of HDFS is better than S3 or EFS but I would prefer to use EFS in this case since it also utilizes caching under the hood and cheaper alternative.
Oh great, thanks for the reply. I think thats about where I think we'll land... keep S3 as the primary source, but have HDFS be used for intermediate jobs.
Good luck and have fun! :D
Depends on your definition of 'larger' -- if this data is on S3 currently I can't imagine we're talking multi-TB working sets here?

Generally speaking, HDFS is going to be a clusterfuck to support unless you give a load of cash to cloudera (actually, it will be regardless but slightly better with the bill) -- even then you'll get the typical db vendor line of 'not running -some patchset ver-, then upgrade. Which is really risky on a large cluster which pretty much works as you want.

Also, unless you've got a load of hardware you can dedicate to this environment, then you're going to be spending a lot of money on IAAS bills and your performance is probably not going to be very good. (Yeah sure you can virtualize HDFS but generally I passthrough local storage to the VM's, and only run demo on AWS etc).

There was been a push towards such mental complexity and folks convincing themselves they needed to solve their problems in this manner, and now a bit of an ebb backwards (at least, in the general space) now that your avg deployer found out how hard it is to do this stuff even with good support. Massive data ingestion and huge batch jobs might be a solution to a given problem you have, but it's probably not the only one whereas it's almost certainly going to be the most difficult and expensive.

Personally, I'd avoid hdfs, flume, hfs, zookeeper and all the rest of the nightmares until you're absolutely sure that you need them (and if you're not already, then you probably don't).

Also: Check out manta from joyent. :}

S3 is ideal for multi TB working set.

That should be the de-factor standard for TB scale. In fact, don't bother comparing other products if you're TB scale, just use S3.

Really?

Say you're going to ETL or Map/Reduce over all that data a lot of times, you're telling me that reading it all for processing over S3's rest api (which is the only method?) instead of, say, a local array of 15k sas's over pcie hba's is ideal?

It's pretty expensive and inefficient to my eyes, what am I missing? I

In what way would S3 be better than running this on your own gear if cost and perf are clearly not going to be better (which are really the big factors in this decision)?

You're missing that S3 is the storage system for RedShift and EMR (emr = managed hadoop on AWS).

They are pretty cheap, efficient and simple to use ;)

I would recommend S3.

Using S3 with EMR in production was breeze for us. Even cost effective, since you can play with spot instances depending on your jobs. You also improve utilization of your resources.

With recent Athena it is possible also to do ad hoc queries directly :) Before it required starting "QA" cluster.