Hacker News new | ask | show | jobs
by PaulHoule 813 days ago
That’s interesting.

I have predictive models that can predict if a headline (w/o the rest of the article and not considering the URL) will (a) get more than 10 votes and (b) if it does get more than 10 votes will the votes/comments ratio be more than 2 (which is roughly average)

The first model gets a ROC-AUC (see https://scikit-learn.org/stable/modules/generated/sklearn.me...) in the low 60’s (not good, the second model gets in the low 70’s (actually pretty good though it is a heat seeking missile for clickbait headlines) and my latest content-based recommender for RSS items gets almost 80. (I saw a paper that one system at TikTok gets about 85)

To do all that you need about 10,000 headlines and don’t get a lot of benefit from having more than 100,000. The ceilings on performance have more to do with the nature of the problem rather than my models: the same article can get submitted twice and get 0 votes one time and 200 the other time so it can never be as accurate as “is this an article about galactic astronomy?”

I had it ingest the HN comments firehose and found the amount of articles was overwhelming, my YOShInOn RSS reader now ingests the “best comments” from

https://hnrss.github.io/

together with 110 other feeds and actually I like the comments it picks out a lot. Now that the system is adding about 3000 items per day it might be able to handle a big feed like the comments firehose since now those comments are diluted with so many quality articles. For a problem like that you might want a two-score system with: (i) is it relevant? (something I like) and (ii) is it popular? (like Google’s PageRank)

I think you could make a model that compares comments in the best comments feed with other comments. I have tried formulating the problems above as regression problems where I try to predict the actual score and it does not work well because of the uncertainty problem but formulated as a classification problem for a score over a threshold it is easy to make a well-calibrated model that tells you “this article has a 20% chance of frontpaging” which is about the best anyone can do.

1 comments

wow, that's all sorts of interesting.

You could also look at the commentors karma at time of posting a comment and a while after and guess at which comment got them the points.

It is hard to tell because it might take a while for a post to get comments and in the meantime the person writes more comments.

It might be more practical to add up the score of a user’s submissions and subtract that from the total to get a comments karma score and then divide that by the number of comments to get an average which at least gives you a per-user rank which would be worth something.