|
|
|
|
|
by dianamp
3483 days ago
|
|
We've considered HDFS, but we really liked the idea of having compute only clusters and have our data kept completely separate. Clusters failure happen and having data on S3 makes us worry less if a cluster goes down. Just spin up a new one and you're good to go. There is a bit of more latency when using S3 compared to HDFS, but it's not bad and the benefits overcame that. We do have a couple of jobs that store some intermediate results in HDFS, but in the end everything lands in S3. We encountered a few issues with S3 at the beginning mostly around the eventual consistency, but nothing that could not be fixed. |
|