| If you use EMR or just roll your own Hadoop in EC2 then: 1. Hadoop runs on EC2
2. Data is stored on S3
3. Intermediary results stored in EC2
4. Hadoop loads the data from S3 to EC2
5. EC2<->S3 bandwidth is not that fast or efficient (S3 proxy, network contention, TCP/IP processing) Hypothetical MapReduce/ZeroVM/Swift scenario: 1. Data is stored on S3/Swift
2. Map and Reduce functions are run inside S3/Swift secured by ZeroVM in majority of cases accessing data locally without networking/proxies getting in the way.
3. Intermediate and final results are also stored within S3/Swift.
4. Local data access is efficient, fast and predictable
5. Local networking within S3/Swift is more efficient, fast and predictable than S3<->EC2 / Swift<->Nova Accelerated Hadoop scenario: Exactly as in #1, just Hadoop makes "predicate pushdown optimization" into S3/Swift secured by ZeroVM. Regarding 'due to security restrictions' I meant that cloud vendor would not let you run your own code in S3 or CloudFiles. Why? Because you could mess up other people data and storage system itself. Why not run in VM inside S3? well I guess it would be impractical due to long provisioning time of conventional VM. |