Hacker News new | ask | show | jobs
by a_macgregor 4892 days ago
Stavrosk,

If what you want is to only classify the stories in the front page and classify them based on a preset of categories, that's actually pretty simple to do.

I been working on a similar concept for personal project. Here are my recommendations:

- Be sure to remove stopwords from the titles before using the classifier. - The ankusa gem will help you greatly https://github.com/bmuller/ankusa

Ankusa is a naive bayesian text classifier that will come really handy for the task you are trying to achieve.

Also make sure your training data sets are pretty clean and with little overlapping as possible.

Finally have fun and let us knows how it goes!!

Cheers and let me know if you have more questions or if you want a hand coding this thing.

1 comments

Thanks for your answer! What I'm thinking of making is basically separating posts into two categories, things that interest me and things that don't. Then, I want to receive emails at intervals I specify. This is so I no longer have the urge to check HN frequently, but still stay up t date.

The actual classification is probably the easy part, the hard part is training the model, which is why I wanted to ask if anyone had done it before. Have you managed to train anything to recognize your tastes, or is it objective categories? How well does it work?

Well, my classifier works based on categories like ruby, programming, php, magento etc.

To train the classifier I grabbed feeds from different reddits and used that as a based data set. What you are trying to achieve sounds more like a recommendation engine rather than a classifier maybe recommendify might come handy https://github.com/paulasmuth/recommendify

You still can use the bayesian classifier, for training it I would recommend the supervised training route, basically start with a small dataset(100 records) and manually classify each of the training examples.

Also you should leave some sort of way to provide feedback to your classifier to improve the results and make corrections

Yeah, I'll have upvotes and downvotes to tell it what I liked or didn't. Unfortunately, I can't see a way to do this without supervised learning (maybe semi-supervised would work), which is why I posted here for ideas (I want to avoid the costly supervision step if someone knows the result won't work).

Thanks for your comments, they help a lot.

I suppose you could train the classifier by having it record what you upvote, or which links you click on. Perhaps a Firefox/Chrome extension could do that?

Some people at Reddit were programming a recommender about a year ago: http://www.reddit.com/r/redditdev/comments/lowwf/attempt_2_w... It doesn't use a Naive Bayesian Classifier but it might still interest you.

I'm currently using a very simple bookmarklet scheme, one for upvote and one for downvote. It works very well for collecting data, I'll train it later tonight, I think.

Thank you for the link, it looks very extensive, I'll peruse it later on.