Hacker News new | ask | show | jobs
by navi54 3798 days ago
I am interested in this, mostly as a computational biologist. Any intro on Hadoop? What is it used for?
1 comments

Hadoop itself consists of different parts: HDFS is a distributed filesystem that can span across lot's of machines and stores data in blobs of varying sizes.

MapReduce is the Google idea from before 2004 how to do calculations on lot's of data. Now there is also YARN that could be described as a general job scheduler.

At the moment a lot of people use software on YARK like Spark (does more in memory, is faster, can use the GPU on the cluster machines).

So if you have biological data you could feed that into a HDFS and have Spark or MapReduce jobs that process the data. The clue about Hadoop is that don't need to care about getting the clustering and distributed setup right. This is done by Hadoop for you. You program like you would program a single thread algorithm (at least in the simple cases).

E.g. this Python code counts words over as much data as you want in parallel: http://www.michael-noll.com/tutorials/writing-an-hadoop-mapr...

If you google each of these projects you'll find a lot of information.

Here are the original papers that should give a good idea:

- HFDS (Google Filesystem was the original idea - Hadoop was a free implementation by Yahoo - http://static.googleusercontent.com/media/research.google.co...

- MapReduce - http://static.googleusercontent.com/media/research.google.co...

- Spark: http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark...

If you want a good introduction read Hadoop: The Definitive Guide, 3rd Edition

We used it for building a search engine out of multiple terabytes of crawl data - something that fit's not good on a single computer.

You can do all kinds of computations but graph problems or things that not good to parallelize require often other solutions beyond MapReduce or more thought - MapReduce also only fits for a certain class of problems - it's great for counting and aggregation of stuff but beyond that it's often not usable.