Hacker News new | ask | show | jobs
by opportune 3190 days ago
Typo when you don't have the right kind of repos starred: "yeild" should be yield

I've worked with NLP a bit before, but haven't worked with LDA and have only read the wikipedia article and gensim documentation. One thing I don't understand is why you only generate a single topic for each user, and then query the top n (5) terms. From what I understand of LDA, its usefulness is in partitioning text into k separate topics based on how often words are used in similar contexts. In my mind, this is more or less analogous (please tell me if this is wrong) to finding k centroids for a vector representation of text after training a word2vec mapping (in an appropriately low dimension given the document size) on that text. However, if you are only finding a single topic, you are only using one centroid, so your search will be the n tokens that are closest to the centroid. I'm pretty sure that the tokens (from the text) closest to the centroid of a word2vec mapping trained on a text will mostly consist of high-frequency words and semi-stop words (by this I mean words used in varied contexts because of their use in language, but not filtered by the stop word check).

Then if someone has many different topical interests, LDA might over-represent whichever topic has the plurality of text dedicated to it. For example if my starred repos are something like 30% Fortran, 30% Javascript, 40% Java, I believe your algorithm will mostly contain Java terms as queries. This seems to run counter to the goal of using LDA, which would be (to my understanding) to identify these latent topics and give relevant queries for each one / combining them.

I think a good way to address this would be to implement some way to change the default number of topics. One approach may be to use a trained (perhaps on github instances itself) word2vec instance to determine the "spread" of the incoming tokens: you could construct cliques based on pairwise distance between vectors and do something with that (let k be the number of cliques, or the number of cliques of size greater than m, etc.).

A different approach might be to precompute the vector average of each github repo. Then you could perform richer comparisons directly to documents (e.g. compare each clique's centroid to the repos) without directly querying github for tons of repos.

1 comments

Thanks for your suggestions. Will keep track of it and try including them in the next run. Raise it as an issue if it bothers you a lot.

Also, I have rectified the typo.