Hacker News new | ask | show | jobs
by PaulHoule 571 days ago
I have my own personal recommender YOShInOn which is an RSS reader that shows me about 5% of what it ingested. If you look me up in the profile I could show you a demo.

My answer to the diversity problem is this: out of maybe 2000-10,000 items I have the system make N=20 clusters with

https://scikit-learn.org/1.5/modules/generated/sklearn.clust...

and instead of picking out the 300 items with the highest score I pick the top 15 items in each cluster. Everything I post to HN was selected by YOShInOn once and by me twice and I think you can see the clusters at work if you note I post articles about programming, sports, environmental issues, advanced manufacturing, omics, energy technology, etc. If I pulled the top 300 it would all be arXiv papers about recommender systems with a few "circular economy" manufacturing topics.

If you found, say, 200 HN articles on a certain topic last week you might need a smaller cluster size, maybe N=5. There are other approaches to the diversity problem in the literature too but this one is easy.

I get amazing results with SBERT embeddings on HN titles and similar short texts. There is the trouble of ambiguous titles which nobody could classify but if a title is clear enough for you to get the gist of it, SBERT probably does well on it. If you are crawling the stories you are increasing your data 1000x but you are NOT going to get 1000x better results. Here is how I do on thumbs up/thumbs down classification with just titles and an obsolete algo:

https://ontology2.com/essays/ClassifyingHackerNewsArticles/

SBERT would put a few points of AUC on, I could imagine into the low .8's, but the up/down classification is noisy.

You actually could make a decent prototype that just uses the titles and not face webcralwer problems, context window not big enough problems, etc.

1 comments

Paul, this is quite amazing. Can I email you later? I have been "sniffing" about this idea for a little while and I am finishing up a smaller project, this is the next thing I want to tackle.