Hacker News new | ask | show | jobs
by osipov 6635 days ago
i think you are looking for document classification algorithms: http://en.wikipedia.org/wiki/Document_classification

the current state of the art algorithms are based on support vector machines, but their learning part could be tricky to implement in a scalable fashion. if you are looking for a quick and dirty approach, TFIDF algorithm (it is a naive "naive Bayes" :) is simple and is adequate for many applications

2 comments

forgot to mention that you may want to look at the Orange framework which is in Python http://www.ailab.si/orange/
great! that does look interesting. I'll have a good look through it. You involved in this field yourself? If so, drop me a line as we're looking for a consultant to help us with something
I've done some work on machine learning and specifically document classification in a corporate behemoth. Send me an email to gmail.com and prefix that with osipov followed by an @ sign -- I'll get back to you.
This sounds interesting to me too, and in an area where I've had some experience. If nothing else perhaps we can trade notes. You'll find contact details in my profile.
I think every language has an open source naive Bayes/Bayes network implementation. And most of the time they are good enough.

KVM (support vector machines) so far is considered the best classification algorithm.

>KVM (support vector machines) so far is considered the best classification algorithm.

In the spirit of accuracy, SVM algorithms aren't _the best_. The best algorithms are ensemble-based, incorporating SVM and alternatives.