Hacker News new | ask | show | jobs
by camuel 5199 days ago
If you use EMR or just roll your own Hadoop in EC2 then:

1. Hadoop runs on EC2 2. Data is stored on S3 3. Intermediary results stored in EC2 4. Hadoop loads the data from S3 to EC2 5. EC2<->S3 bandwidth is not that fast or efficient (S3 proxy, network contention, TCP/IP processing)

Hypothetical MapReduce/ZeroVM/Swift scenario:

1. Data is stored on S3/Swift 2. Map and Reduce functions are run inside S3/Swift secured by ZeroVM in majority of cases accessing data locally without networking/proxies getting in the way. 3. Intermediate and final results are also stored within S3/Swift. 4. Local data access is efficient, fast and predictable 5. Local networking within S3/Swift is more efficient, fast and predictable than S3<->EC2 / Swift<->Nova

Accelerated Hadoop scenario:

Exactly as in #1, just Hadoop makes "predicate pushdown optimization" into S3/Swift secured by ZeroVM.

Regarding 'due to security restrictions' I meant that cloud vendor would not let you run your own code in S3 or CloudFiles. Why? Because you could mess up other people data and storage system itself. Why not run in VM inside S3? well I guess it would be impractical due to long provisioning time of conventional VM.

1 comments

That criticism is specific to S3, not EC2 or Hadoop. It's perfectly feasible and probably preferable to have Hadoop work on local files in instance store volumes (or EBS if you're mad).
There is other issue with running hadoop on EC2 (w/o S3). Instance storage is relatively small - about 3.6 TB on largest instance and 1.5 TB on other "large" instances. In typical Hadoop machine I would expect about 8TB. So local storage is prohibitively expensive for the big data tasks. In the same time - if we use local storage we a loosing elasticity - we have to run cluster all the time, even there is no jobs to run. It kills main point of using hadoop in the cloud - to pay for the computational resources on demand.
but instance store is transient! You may argue if you do triple replicated in different availability zones then you are ok. Well, in this case it would be very costly as you will end up with constantly spinning EC2 cluster. Even if you don't do any computation you must keep it all spinning. And see what happened to elasticity... you end up paying inflated cloud prices for constantly spinning fixed size EC2 cluster! Instead of being able to rapidly roll out large cluster, make the computation and fold it back and pay only for what you have used - isn't it the true promise of cloud?