Hacker News new | ask | show | jobs
by futureishere 3189 days ago
That is one really cool application of LDA!

I am so curious about your implementation, for instance, what sort of preprocessing did you have to carry out? I had written a script sometime back to analyze Paul Graham's essays (link: https://github.com/futureUnsure/pg-essay-lda), and had to remove date and times because they appeared a lot and distorted the top topics. I'm wondering if you had to do something similar for text that described code?

Also, did you write an LDA library yourself or did you leverage an existing library?

I apologize in advance if my questions sound naive/stupid, am just a noob...

1 comments

Thanks, I am using gensim package for LDA. In a nutshell:

1. Get descriptions of repos user is interested in 2. Cleanup/Filtering/Tokenization 3. Use LDA to generate Topics 3. Use the topics to search for repositories github can provide.