Hacker News new | ask | show | jobs
by jmillerinc 5807 days ago
You can implement what you want with simple keyword & url filtering.
3 comments

No, I'm sorry, but I can't use keyword filtering for what I'm describing. Let me explain:

What I'm talking about here is uncovering "latent" communities, if you will. As in, make a giant matrix with the users being the columns and the posts being the rows and then use the eigenvectors to make recommendations (see SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition)

The benefit of this approach is that I no longer have to be conscious of the topics I am filtering in or out. Even keyword based filtering is, again, a coarse estimation of relevance. I may be very interested in clojure, but I'm certainly not interested in every article that contains 'clojure' in the title.

An SVD (or similar) approach would filter my interests loosely on the co-occurrence of votes. That is, a vote from someone with whom I have high overlap is worth more to me than a vote from someone with whom I have never voted the same direction on the same post.

I question whether SVD would yield good recommendations.

In any case, co-voting data is not scrape-able from the public HN site, so I think using keywords and urls is really the only realistic filtering option at this point.

You can use people's comments as a (loose) proxy for their interest in a post; people who comment on something are more likely to have upvoted it (or at least consider it worthwhile to talk about, even if they never really vote on things.) You could perhaps even use Sentiment analysis, and take negative (root-level) comments as downvotes (and prune any branch below a negative comment, because it's probably an argument.)
Speaking of this, does anyone know of a good RSS filter? By that I mean a service to which you give a link to an RSS feed and provide certain filters, and they will provide a link to a modified version of the feed that they host themselves.
Yahoo Pipes is the closest thing I know to that.

http://pipes.yahoo.com/pipes/

Here's a screenshot of one of mine http://imgur.com/NLOkM.png

Awesome, this is exactly what I wanted! By filtering out all the Apple/iPhone/iPad-related crap, I'll probably be able to cut the number of new items from tech blogs in half.
And if it's a Gawker Media site, they allow you to filter their by changing the URI. For example, http://gizmodo.com/tag/not:apple/not:iphone/not:ipad/index.x...

I wish more sites had that kind of filtering.

Oh, that's even better. I don't subscribe to Gizmodo, but I do subscribe to Lifehacker.
Everyone forgets about postrank.com (formerly aiderss.com) It is actually what I use to filter HN... but who knows, I may switch over to the HN50 or HN100 described above.
I don't think that achieves the same thing. First, it requires the user to manually filter out new stuff they aren't interested in, where as a machine learning approach will evolve as the content space evolves.

Think of it like email spam. You can setup manual filters to filter out email spam, but that is a constant and never ending stream of work for you. A simple bayesian filter like pg has described will require far less work and give far better results.

In this case, a machine learning approach is even better because it can bring up stories that a user will be very interested in even though the story would never make it to the current homepage.