Hacker News new | ask | show | jobs
by jayunit 3189 days ago
Cool project! Thanks for publishing and sharing it.

It'd be interesting to know what topic terms it produces for each of my repos. It looks like it's taking all the repo descriptions, producing a topic model over that corpus with a single topic (`LdaModel(num_topics=1)`), and retrieving the top N terms for that topic. Those topic terms will be the most frequent words from the topic, so I think this will end up producing the most frequent words from the cleaned token set.

I'd be curious to see what happens if you could run LDA over the full dataset, produce multiple topics, and suggest repos based on those topics. This would be a pretty fun extension to the project!

If you're just running LDA over the repo description (and not looking into the content of any file, e.g. README), might http://ghtorrent.org/ be able to provide this?

Alternatively, maybe you want to include text from the README files -- could you use the Google Data snapshot of GitHub https://cloud.google.com/bigquery/public-data/github and do analysis like this: https://blog.exploratory.io/clustering-r-packages-based-on-g...

Or, it might be interesting to try producing a vector representation per repo by taking the description (and readme?), and doing something like: produce word vectors for each word, and sum the word vectors. https://spacy.io/ is a nice-to-use library that could help here.

Once you have a vector representation for each repo, using a distance metric cosine similarity could find related repos. Or (depending on the dataset size / performance) an approximation like spill trees or LSH forest.

Looking forward to seeing where this goes next!

1 comments

Some really good suggestions, can you please raise an issue on the repo so that we can keep track of the same.