| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by osipov 6682 days ago
	i think you are looking for document classification algorithms: http://en.wikipedia.org/wiki/Document_classification the current state of the art algorithms are based on support vector machines, but their learning part could be tricky to implement in a scalable fashion. if you are looking for a quick and dirty approach, TFIDF algorithm (it is a naive "naive Bayes" :) is simple and is adequate for many applications

2 comments

osipov 6682 days ago

forgot to mention that you may want to look at the Orange framework which is in Python http://www.ailab.si/orange/

link

groovyone 6682 days ago

great! that does look interesting. I'll have a good look through it. You involved in this field yourself? If so, drop me a line as we're looking for a consultant to help us with something

link

osipov 6682 days ago

I've done some work on machine learning and specifically document classification in a corporate behemoth. Send me an email to gmail.com and prefix that with osipov followed by an @ sign -- I'll get back to you.

link

wehriam 6682 days ago

This sounds interesting to me too, and in an area where I've had some experience. If nothing else perhaps we can trade notes. You'll find contact details in my profile.

link

yawl 6682 days ago

I think every language has an open source naive Bayes/Bayes network implementation. And most of the time they are good enough.

KVM (support vector machines) so far is considered the best classification algorithm.

link

osipov 6682 days ago

>KVM (support vector machines) so far is considered the best classification algorithm.

In the spirit of accuracy, SVM algorithms aren't _the best_. The best algorithms are ensemble-based, incorporating SVM and alternatives.

link