|
|
|
|
|
by law
5330 days ago
|
|
Honestly, frameworks like Mahout and Weka have their place, and that's typically for exploratory data analysis. My belief is that for large-scale, extremely intensive machine learning, your best bet is to implement algorithms tailored to the job at hand. Algorithms like logistic regression work fine if your data is linearly separable, but it's not a panacea. None of the algorithms are. If you're interested in machine learning and artificial intelligence, I very strongly consider "enrolling" in Tom Mitchell's machine learning class at http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml -- the lectures are long and the mid-term and final are extremely difficult, but the material covered is an outstanding primer for these types of analyses. After going through all of the lectures, you will look at things like Mahout and Weka as mere toys, and will be equipped to write your own implementations for whatever task you and your company are working on. It's a lot of front-loading for rewards that may at first glance seem illusory, but investing the time now will pay dividends later. |
|
If you really understand enough to implement new classifiers or other types of learning algorithms, these libraries are still useful to you. For one, they provide a solid framework for allowing your new algorithm to easily interact with other algorithms. Two, it's not unlikely that your new algorithm is a variation on an existing one. Don't re-implement it. These libraries are open, so copy the source and modify it. And three, mahout uses hadoop. Distributed processing systems are another topic altogether. If you are proposing to write your own, I would hope that you have good reasons for spending the time. Hadoop is certainly no toy.
In summary, don't waste time reimplementing core algorithms unless you are doing it for a learning exercise. But do still take a good course on machine learning, because using the provided algorithms in these packages and others correctly is highly non-trivial.