Hacker News new | ask | show | jobs
by dmk23 5330 days ago
Mahout is a great platform, but the real challenge is defining your learning problems, preparing data sets and choosing right algorithms.

Once you are clear as to what you actually want to accomplish chances are you are going to need some kind of significantly modified or hybrid algorithm. Packages like Mahout could help get started, but it is kinda funny that even quite a few examples in this article do not demonstrate actually good algorithm performance, like this one -

  Correctly Classified Instances : 41523 61.9219%
  Incorrectly Classified Instances : 25534 38.0781%
  Total Classified Instances : 67057
  =======================================================
  Confusion Matrix
  -------------------------------------------------------
  a b c d e f ><--Classified as
  190440 12 1069 0 0 | 20125 a= cocoon_apache_org_dev
  2066 0 1 477 0 0 | 2544 b= cocoon_apache_org_docs
  165480 2370 704 0 0 | 19622 c= cocoon_apache_org_users
  58 0 0 201090 0 | 20167 d= commons_apache_org_dev
  147 0 1 4451 0 0 | 4599 e= commons_apache_org_user
1 comments

There are decimal dots missing in the confusion matrix numbers (i.e., 190440 should read 19044.0, in case anyone else was wondering why the numbers don't add up).

If anything, the article convinced me not to use Mahout. So, the author decided to use the simplest algorithm, Naive Bayes, and got miserable results (from the article: "This is possibly due to a bug in Mahout that the community is still investigating."). He then changed to problem formulation in order to get better results, and concluded by saying the outcome is still likely a bug, but he's happy with it anyway?

This would be probably fine if we were talking about a small, nimble project that you could go into and hack/fix yourself. But we're talking about a massive, Java codebase. The thought of customizing it makes me shudder.

EDIT: forgot to mention I agree with the parent comment completely, except I would add "... and choosing the right evaluation process" to the initial sentence.