|
|
|
|
|
by stingraycharles
5286 days ago
|
|
If you settle with Java or a bit of java extensions, you can probably write your own TaskSplitter and define a way that hadoop should distribute your jobs into smaller tasks. Be aware: you might end up either having a lot of trouble getting the 'optimal splits', or you'll lose one of Hadoop's major advantages, data (calculation) locality (for example, when you decide to combine 10 smaller files into a single task, and you have 10 different DataNodes, chances are small that all files are stored on the machine that's performing the MapReduce task). One thing to note, though: HDFS is indeed very stream oriented. It works in blocks of 64 MB (by default), and only sends data upstream when you either close a file or a full block is available to be written. So, when your servers crashes at 63MB, and you have unrecoverable data, you'll have lost all 63MB of data. That was one of the big caveats we had to work around for our own problems we solve with Hadoop. |
|