| HN Mirror

Correctly Classified Instances : 41523 61.9219% Incorrectly Classified Instances : 25534 38.0781% Total Classified Instances : 67057 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f ><--Classified as 190440 12 1069 0 0 | 20125 a= cocoon_apache_org_dev 2066 0 1 477 0 0 | 2544 b= cocoon_apache_org_docs 165480 2370 704 0 0 | 19622 c= cocoon_apache_org_users 58 0 0 201090 0 | 20167 d= commons_apache_org_dev 147 0 1 4451 0 0 | 4599 e= commons_apache_org_user

There are decimal dots missing in the confusion matrix numbers (i.e., 190440 should read 19044.0, in case anyone else was wondering why the numbers don't add up).

If anything, the article convinced me not to use Mahout. So, the author decided to use the simplest algorithm, Naive Bayes, and got miserable results (from the article: "This is possibly due to a bug in Mahout that the community is still investigating."). He then changed to problem formulation in order to get better results, and concluded by saying the outcome is still likely a bug, but he's happy with it anyway?

This would be probably fine if we were talking about a small, nimble project that you could go into and hack/fix yourself. But we're talking about a massive, Java codebase. The thought of customizing it makes me shudder.

EDIT: forgot to mention I agree with the parent comment completely, except I would add "... and choosing the right evaluation process" to the initial sentence.