Hacker News new | ask | show | jobs
by law 5330 days ago
Honestly, frameworks like Mahout and Weka have their place, and that's typically for exploratory data analysis. My belief is that for large-scale, extremely intensive machine learning, your best bet is to implement algorithms tailored to the job at hand. Algorithms like logistic regression work fine if your data is linearly separable, but it's not a panacea. None of the algorithms are.

If you're interested in machine learning and artificial intelligence, I very strongly consider "enrolling" in Tom Mitchell's machine learning class at http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml -- the lectures are long and the mid-term and final are extremely difficult, but the material covered is an outstanding primer for these types of analyses.

After going through all of the lectures, you will look at things like Mahout and Weka as mere toys, and will be equipped to write your own implementations for whatever task you and your company are working on. It's a lot of front-loading for rewards that may at first glance seem illusory, but investing the time now will pay dividends later.

2 comments

Libraries like Weka and Mahout are no more toys than any other library that implements standard and widely applicable algorithms. Yes, you need to do a lot of extra work to properly model your problems, choose features, and combine different algorithms into a final product. But it's not often that you need to tweak the core algorithms that these libraries provide.

If you really understand enough to implement new classifiers or other types of learning algorithms, these libraries are still useful to you. For one, they provide a solid framework for allowing your new algorithm to easily interact with other algorithms. Two, it's not unlikely that your new algorithm is a variation on an existing one. Don't re-implement it. These libraries are open, so copy the source and modify it. And three, mahout uses hadoop. Distributed processing systems are another topic altogether. If you are proposing to write your own, I would hope that you have good reasons for spending the time. Hadoop is certainly no toy.

In summary, don't waste time reimplementing core algorithms unless you are doing it for a learning exercise. But do still take a good course on machine learning, because using the provided algorithms in these packages and others correctly is highly non-trivial.

Dunno about weka but my last experience ( 5 months back) with Mahout was not good. There still are quite a few bugs and the fact that entire code base is in Java makes it extremely unpleasant for someone who wants to hack and modify the code to jump right in and start tweaking stuff. However, in its defense, it is open source is probably the only hadoopified ml library out there and has given me a ton of good ideas on how to write custom code.
Wow, we disagree. As much as I like to do my own development in dynamic languages like Clojure and JRuby, for me:

I would much rather have library and framework code that someone else has written, debugged, and supports to be written in Java: easy to browse in a good IDE, statically typed, lots of unit tests so you can hack away with some protection, etc.

Maybe my point wasn't clear enough: 1. I am comfortable with using someone else's library without having to reinvent the wheel but I want to know exactly what I am getting into without having to browse through tons of Java code. There are zillions of variants of algorithm X but I want to know exactly which implementation/variant Mahout uses without going through the source code. Unfortunately the docs (at least 4 months back) were pretty bad.

2. Their unit test coverage was not good enough which incidentally is how I found that there were bugs. The problem in trying to contribute back to the community by trying to rectify these bugs? When I read the source code, I get the feeling that each algorithm is owned to a great extent by one developer who brings in their own idiosyncrasies which means that you need to really study the code to make sure you don't accidentally add more bugs. The other disadvantage of this approach is that questions regarding potential bugs and puzzling issues can go unanswered or answered in an unsatisfactory manner (mainly because of the one developer writing most of the code issue).

Having said all this, I want to be charitable and chart these to growing pains. But if I were building something critical and big dataish, I would either use Python (dumbo) or Scala which are much more concise languages where it is easier to express math without introducing bugs.

You're correct to identify the point of libraries like Weka and Mahout, which are both written in Java, as providing a solid framework for interaction between and among your program and other algorithms. However, Java isn't the right solution for everyone. Moreover, in Weka's case, the GPL licensing may not comport with everyone's requirements. Mahout's license is more friendly to proprietary software, so it's admittedly a non-issue there.

I agree that hadoop is certainly not a toy, but using Mahout on hadoop clusters works better for analyzing large data sets that you've already collected and pre-processed. If you're doing any kind of active learning, or are designing software to run on a client's computer based on feedback that they provide, mahout probably isn't the best choice.

In the end, it requires understanding your problem completely enough to justify your decision.

re: Weka GPL: one of my customers simply bought a commercial license. Easy.
There aren't very many statisticians/MLers who suggest (or practice) reimplementing your own algorithms, except for quite simple things, because the risk of getting something wrong is pretty high, and the work to make things efficient is non-trivial. If anything, the current push is in the other direction, towards encouraging more people to share their code, and more people to use well-tested code, through initiatives like http://jmlr.csail.mit.edu/mloss/ , http://www.jstatsoft.org/ , and CRAN.

For example, you could reimplement your own SVM instead of using http://svmlight.joachims.org/ , but your chance of producing something correct and as efficient is pretty low...

I think the choice between using existing libraries and implementing your own mostly depends on how central particular algorithms are to your product. If a better algorithm makes a great difference for my customers then it's insane for me to use an existing library.

I don't even find much value in looking at existing code as a starting point because it's bound to be either obscured by lots of optimizations or naive or it's university code left behind by someone finishing their thesis in a hurry. For code beyond a certain level of complexity I prefer to either use it as a black box or implement it myself.

Obviously, if the algorithm is not a core component of my product it's insane to waste time on reimplementing it, provided there is a good quality implementation that has the right license.

It's a fine line to walk. On the one hand, community-vetted code is a spectacular idea for the core algorithms, but on the other, overly-restrictive licenses (like [L]GPL) effectively preclude the maximum utility being derived from them.
OTOH, if you do need to write your own code, you could use the existing test suites to make it a lot easier. I think the GPL would allow this.