Hacker News new | ask | show | jobs
by plam 5242 days ago
Could you elaborate on #1 please? Wouldn't a distributed cache defeat the purpose of data locality of Hadoop? Regardless, I guess one could write a tap to Avout to enable this?
1 comments

Sorry, just saw this reply. Hadoop comes with a distributed cache that is generally used for small files -- a common example would be doing a large join against a small table that would fit in memory. For example if you wanted to filter out stopwords or something, the currently accepted way is to put this stopword list into the resources/ directory of your JAR, which is not really optimal for data that might change frequently.

http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/...

and for discussions related to Cascalog: http://groups.google.com/group/cascalog-user/browse_thread/t... and https://groups.google.com/forum/#!topic/cascalog-user/l5SEW3...

I have not seen any info on using Cascalog alongside Avout, but the idea makes sense.