| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by haberman 5220 days ago

> There is no tradeoff, just advantages.

Though I don't have deep expertise in Hadoop, I find this claim highly suspect. High-level APIs achieve user-friendliness by making decisions/assumptions about the way a lower-level API will be used. I would be very surprised if there was no use case for which your API does impose a trade-off vs. the low-level Hadoop API.

I feel much more confident using a high-level API if its author is up-front about what assumptions it's making. If the claim is that there is no trade-off vs. the low-level API, I generally conclude that the author doesn't understand the problem space well enough to know what those trade-offs are.

I could be wrong, but this is my bias/experience.

1 comments

ferrerabertran 5220 days ago

Hi haberman, I'm one of the developers of Pangool. Let me try to clarify why we stated that. I understand it may sound aggresive.

Pangool is based on an extension of the MapReduce model we suggest and call "Tuple MapReduce". This is explained in detail in this post: http://www.datasalt.com/2012/02/tuple-mapreduce-beyond-the-c...

What this means is that in Pangool, if you worked with 2-sized Tuples, you would be able to do exactly the same that you do now with Java MapReduce - That includes custom RawComparators and arbitrary business logic in any place of the MapReduce chain (Mapper, Combiner, Reducer). Using n-sized Tuples together with Pangool's group & sort by, reduce-side join API will only mean less code, easier code at no loss of performance or flexibility.

Realize that Pangool is still a MapReduce API so it doesn't add any level of abstraction.

We designed Pangool with the aim of offering it as a replacement of the current MapReduce API. Therefore we are not labelling it as a "higher-level API" but as comparable low-level API.

On the other hand we are also benchmarking Pangool to show it doesn't impose a performance overhead: http://pangool.net/benchmark.html

link

haberman 5220 days ago

It sounds like you are implementing an in-memory data structure (Tuple) and serialization of that data structure on top of the raw strings provided by the Hadoop API. While I can believe that the overall overhead of this would be small in many cases, you would observe it most severely in cases where your data was natively key/value pairs of very short strings, or where you had lots of tuples with very short payloads. Do any of your performance tests cover this case? I would expect Pangool to display more than negligible CPU and memory overhead in this case.

Also, since the data model is more complicated and provides more features, it takes more code and a more complex implementation. This could be significant if you were trying to port the model to another language or implementation, or were trying to formally things about the code or mathematical model, etc.

I'm not saying it's not cool; I actually think it's a good and powerful abstraction -- I just object to the characterization of "all features and no tradeoffs".

link

scott_s 5220 days ago

The tradeoff, then, is that if someone's current problem maps exactly to the current API, then your API is more complex than needed.

link

tim_h 5220 days ago

Pangool actually seems like a generalization of Hadoop. This doesn't necessarily make it more complex. If a problem maps exactly to the Hadoop API, then it should also map exactly to the Pangool API by setting m=2 (in the extended map reduce model described at http://www.datasalt.com/2012/02/tuple-mapreduce-beyond-the-c...).

link

scott_s 5220 days ago

I agree with your first sentence, but disagree with the second. That you can find an exact mapping does not prevent the underlying API from being more complex than what you need. That you had to realize "Oh, m=2" is more complexity.

I'm not arguing this is a terrible thing. In fact, I think this is an acceptable level of additional complexity for the power it buys you. But if we're going to make an honest evaluation of the trade-offs, I think we must mention this.

It may be relevant to the discussion to point out that I work on a tuple-based streaming system. Product: http://www-01.ibm.com/software/data/infosphere/streams/ Academic: http://dl.acm.org/citation.cfm?id=1890754.1890761, http://dl.acm.org/citation.cfm?id=1645953.1646061

link