Hacker News new | ask | show | jobs
by stingraycharles 5286 days ago
If you settle with Java or a bit of java extensions, you can probably write your own TaskSplitter and define a way that hadoop should distribute your jobs into smaller tasks. Be aware: you might end up either having a lot of trouble getting the 'optimal splits', or you'll lose one of Hadoop's major advantages, data (calculation) locality (for example, when you decide to combine 10 smaller files into a single task, and you have 10 different DataNodes, chances are small that all files are stored on the machine that's performing the MapReduce task).

One thing to note, though: HDFS is indeed very stream oriented. It works in blocks of 64 MB (by default), and only sends data upstream when you either close a file or a full block is available to be written. So, when your servers crashes at 63MB, and you have unrecoverable data, you'll have lost all 63MB of data. That was one of the big caveats we had to work around for our own problems we solve with Hadoop.

1 comments

This isn't quite true - data is streamed from the client through a pipeline made up of all of the replicas, as it's written. It's true you'll lose data if you crash in the middle of a block, _unless_ you call the sync() function which makes sure the data has been fully replicated to all of the nodes.
Hadoop only writes a block from a client to a DataNode when a whole block is available. This is to minimize the amount of "open connections" in the datanodes (it can take a long time for the client to generate 64MB of data, while distributing the block over the replicas can occur in a relatively short time).

For more information about this, see: http://hadoop.apache.org/common/docs/current/hdfs_design.htm... and http://hadoop.apache.org/common/docs/current/hdfs_design.htm...