| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dianamp 3483 days ago

We've considered HDFS, but we really liked the idea of having compute only clusters and have our data kept completely separate. Clusters failure happen and having data on S3 makes us worry less if a cluster goes down. Just spin up a new one and you're good to go.

There is a bit of more latency when using S3 compared to HDFS, but it's not bad and the benefits overcame that. We do have a couple of jobs that store some intermediate results in HDFS, but in the end everything lands in S3.

We encountered a few issues with S3 at the beginning mostly around the eventual consistency, but nothing that could not be fixed.

3 comments

idunno246 3483 days ago

netflix i think said they see about a 10% perf hit using s3 instead of hdfs, using emr where they launch temporary clusters that do a job and shut down, and that performance cost was well worth the flexibility of being to launch independent clusters whenever they need.

link

buremba 3483 days ago

We're also using S3 but we have a hybrid approach to the problem. The event data is immutable and you use instance stores with EC2 and cache the data to local SSDs and use S3 as backups. The thoughtput of HDFS is better than S3 or EFS but I would prefer to use EFS in this case since it also utilizes caching under the hood and cheaper alternative.

link

mastratton3 3483 days ago

Oh great, thanks for the reply. I think thats about where I think we'll land... keep S3 as the primary source, but have HDFS be used for intermediate jobs.

link

dianamp 3483 days ago

Good luck and have fun! :D

link