| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by akshayubhat 5807 days ago

I could not understand several points made in this article:

1: He mentions infochimps, but according to my knowledge its more of an ebay for datasets rather than support/provider for Big Data Stack, also is it successful? I am unsure how infochimps is related to the Big Data stack.

2: From what I remember reading about 80legs, is that it uses distributed grid computing to run the crawlers (something like SETI @ Home), I doubt Hadoop was ever designed for such applications. So this is surely isn't a Hadoop use case.

3: Quoting:

       While the standard big data stack has made huge strides in making big data more accessible to everyone, it will always fall short against our stack when it comes to the cost of collecting data.  We actually don’t store that much data. Because 80legs users can filter their data on the nodes, they’re able to return the minimum amount of data in their result sets.  The processing (or reduction, pardon the pun) is done on the nodes.  Actual result sets are very small relative to the size of the input set.

Again I am unsure how it is different from Hadoop? First Hadoop uses same principle "to move computation closer to data" hence a crawler implemented using Hadoop (something Hadoop is not intended to do) will also store data locally and not on some other node.

Also he mentions """ We have about 50,000 computers using their excess bandwidth.""" 50,000?? The biggest Hadoop cluster That I know (Yahoo) has ~10-20k nodes, and Hadoop was never meant to be used at 50K scale for crawling. So they had no option other than building their own system, even if they had to build it today.

4: Quoting

      One advantage is optimization — an “off-the-shelf” system is going to have some generalities built into it that can’t be optimized to fit your needs. The opportunity cost of going “standard” is a slew of competitive advantages.

The only issue I can think of regarding Hadoop is that its written in JAVA, otherwise its an extremely extensible piece of software. Unless you are designing a real time messaging system or distributed system for High Frequency Trading, Hadoop is good enough for most of the applications. Also what about cost of finding good enough programmers who are capable of building a system? Another advantage of Hadoop is that in case of a low load the remaining nodes can be used to do something else, maybe processing some data, with your own solution it would be harder to do it. Also your IP and your Secret Sauce isn't of much use, if you dont have solid Patents for them, otherwise they would mostly end up becoming a maintenance nightmare, after original engineers cash out. Also what if the the big company already has Hadoop cluster, it would be even difficult for them to integrate with your computing power.

While I seem to agree with Authors conclusion that a highly focused startup should make their proprietary solution, I cannot agree with his evidence behind that argument. A grid based crawler with 50K machines isn't something that Hadoop was ever designed to support.

1 comments

jdrock 5807 days ago

A major point made in the article is that the standard big data stack does not fit all big data needs.

There's a growing assumption that this stack is sort of all that's needed, and that's just not the case.

link