| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ivanprado 5220 days ago
	Hi, I'm one of the developers of Pangool. The idea of Pangool is not to be yet another higher level API on top of Hadoop but rather to pose a replacement for the low-level Hadoop Java MapReduce API. Pangool has the same performance and flexibility than that of the Java MapReduce API although it makes several things a lot easier and convenient. There is no tradeoff, just advantages. There will be cases where you'd want to use Pig or Cascading. There will be some other cases where you'd want the flexibility and efficiency of MapReduce. For those cases we conceived Pangool. Nowadays only very advanced Hadoop users could write efficiently-performing MapReduce Jobs. Pangool hides all the advanced boilerplate code needed for writing highly efficient MapReduce jobs, making things like secondary sorting or reduce-side joins extremely easy.

2 comments

haberman 5220 days ago

> There is no tradeoff, just advantages.

Though I don't have deep expertise in Hadoop, I find this claim highly suspect. High-level APIs achieve user-friendliness by making decisions/assumptions about the way a lower-level API will be used. I would be very surprised if there was no use case for which your API does impose a trade-off vs. the low-level Hadoop API.

I feel much more confident using a high-level API if its author is up-front about what assumptions it's making. If the claim is that there is no trade-off vs. the low-level API, I generally conclude that the author doesn't understand the problem space well enough to know what those trade-offs are.

I could be wrong, but this is my bias/experience.

link

ferrerabertran 5220 days ago

Hi haberman, I'm one of the developers of Pangool. Let me try to clarify why we stated that. I understand it may sound aggresive.

Pangool is based on an extension of the MapReduce model we suggest and call "Tuple MapReduce". This is explained in detail in this post: http://www.datasalt.com/2012/02/tuple-mapreduce-beyond-the-c...

What this means is that in Pangool, if you worked with 2-sized Tuples, you would be able to do exactly the same that you do now with Java MapReduce - That includes custom RawComparators and arbitrary business logic in any place of the MapReduce chain (Mapper, Combiner, Reducer). Using n-sized Tuples together with Pangool's group & sort by, reduce-side join API will only mean less code, easier code at no loss of performance or flexibility.

Realize that Pangool is still a MapReduce API so it doesn't add any level of abstraction.

We designed Pangool with the aim of offering it as a replacement of the current MapReduce API. Therefore we are not labelling it as a "higher-level API" but as comparable low-level API.

On the other hand we are also benchmarking Pangool to show it doesn't impose a performance overhead: http://pangool.net/benchmark.html

link

haberman 5220 days ago

It sounds like you are implementing an in-memory data structure (Tuple) and serialization of that data structure on top of the raw strings provided by the Hadoop API. While I can believe that the overall overhead of this would be small in many cases, you would observe it most severely in cases where your data was natively key/value pairs of very short strings, or where you had lots of tuples with very short payloads. Do any of your performance tests cover this case? I would expect Pangool to display more than negligible CPU and memory overhead in this case.

Also, since the data model is more complicated and provides more features, it takes more code and a more complex implementation. This could be significant if you were trying to port the model to another language or implementation, or were trying to formally things about the code or mathematical model, etc.

I'm not saying it's not cool; I actually think it's a good and powerful abstraction -- I just object to the characterization of "all features and no tradeoffs".

link

scott_s 5220 days ago

The tradeoff, then, is that if someone's current problem maps exactly to the current API, then your API is more complex than needed.

link

tim_h 5220 days ago

Pangool actually seems like a generalization of Hadoop. This doesn't necessarily make it more complex. If a problem maps exactly to the Hadoop API, then it should also map exactly to the Pangool API by setting m=2 (in the extended map reduce model described at http://www.datasalt.com/2012/02/tuple-mapreduce-beyond-the-c...).

link

scott_s 5220 days ago

I agree with your first sentence, but disagree with the second. That you can find an exact mapping does not prevent the underlying API from being more complex than what you need. That you had to realize "Oh, m=2" is more complexity.

I'm not arguing this is a terrible thing. In fact, I think this is an acceptable level of additional complexity for the power it buys you. But if we're going to make an honest evaluation of the trade-offs, I think we must mention this.

It may be relevant to the discussion to point out that I work on a tuple-based streaming system. Product: http://www-01.ibm.com/software/data/infosphere/streams/ Academic: http://dl.acm.org/citation.cfm?id=1890754.1890761, http://dl.acm.org/citation.cfm?id=1645953.1646061

link

avibryant 5220 days ago

Can you give an example of a job that would be difficult or impossible to perform efficiently with Cascading, but Pangool gives an advantage over raw MapReduce?

link

ivanprado 5220 days ago

Hi avibryant, According to our initial benchmark (http://pangool.net/benchmark.html), secondary sorting in Cascading is slow (http://bit.ly/wTKOxo), showing a 243% performance overhead compared to an efficient implementation in MapReduce. The implementation in MapReduce has a lot of lines (http://bit.ly/yYGnGe) whereas Pangool's implementation is quite simple (http://bit.ly/x9U7Yj). A common application of secondary sort is calculating moving averages, for instance.

link

avibryant 5220 days ago

Ok, so Cascading has a slow implementation of secondary sort, but is there any reason you believe that couldn't be improved? I don't think you're really comparing architectures there, just how well optimized particular implementations are.

I'm asking because in my experience the extra level of abstraction provided by Cascading, Crunch etc is a huge advantage, and if you're making a conscious choice to operate at a lower level, you better be getting something significant in return; it's not clear to me yet what that is.

link

ivanprado 5220 days ago

Pangool is not an alternative for Cascading. For example, at this point, Pangool does not help you managing workflows. If you are starting a MapReduce application, it is probably the best option to start using higher level abstractions: Cascading, Hive, Pig, etc.

But if you are thinking about learning Hadoop using the standard Hadoop API, or if you need for some particular reason to use it for your project, we recommend you to use Pangool instead.

Or if you are considering to implement another abstraction on top of Hadoop, probably using Pangool for it would also be a good idea.

In fact, what we believe is that the default Hadoop API should look like Pangool.

link

squarecog 5219 days ago

You are doing regex matching in the Cascading code, but splitting on a character in the pangool code. The latter is obviously much faster. I don't know that that's the reason for the difference you observe, but it certainly can't hurt to fix that and make the user-supplied code more comparable.

link

ferrerabertran 5219 days ago

Indeed that regex was problematic because it had a bug itself. We replaced that line by RegexSplitter and updated the benchmark page. Please shout if you notice something else wrong. Thanks.

link

ivanprado 5219 days ago

Just for clarify, split() java function is using regexp for the split as well. The code of String.split() is:

return Pattern.compile(regex).split(this, limit);

The benchmark seems fair to me.

link