Hacker News new | ask | show | jobs
by joaomsa 4513 days ago
Very nice. How you determine categories?
1 comments

Using natural language processing and supervised learning, it automatically tagged news with 9 categories.
Should http://news.ycombinator.com/item?id=7183076 be categorized as Web Technology then?
I wonder if you clustered HN articles to find categories automatically. That may not be useful, but I'd be interested in seeing how HN articles are naturally categorized.
That's really interesting. Will you open-source the NLP algorithm?
I am planning to open source it in several months. (Our codes have not been well-commented and well-structured yet...

Our implementation and algorithm detail is followings.

Its categorizing process is written in Python.

Using nltk, it makes corpus with TFIDF model from HN topics and comments. And it generates classifiers from this corpus with SVM algorithm using scipy and numpy.

FYI, its web interface is written in Clojure and ClojureScript.

presumably you've trained it with hand annotated content, or bootstrapped from a few choice hn searches (like ?q=jquery will give you a web tech category)
Yes. You are right.

I trained classifiers with hand annotations (about 1000 contents or so)