|
|
|
|
|
by mastratton3
3481 days ago
|
|
We're actually having a debate now as we're starting to process larger datasets as to whether or not we should keep everything on S3 or start using HDFS w/ Hive. I'm curious if you guys considered HDFS and why you decided to go strictly with S3, and additionally, are there any issues you encounter with S3. |
|
There is a bit of more latency when using S3 compared to HDFS, but it's not bad and the benefits overcame that. We do have a couple of jobs that store some intermediate results in HDFS, but in the end everything lands in S3.
We encountered a few issues with S3 at the beginning mostly around the eventual consistency, but nothing that could not be fixed.