Hacker News new | ask | show | jobs
by duhprey 5933 days ago
I'm not sure I'll have the time to do this, but I've had some good results running Latent Semantic Analysis and Latent Dirichlet Allocation on a similar problem. In my case, I have data from people playing a negotiation game and having a conversation with a human actor. I have scores from a human judge going from 1 - 5. Using LDA on the transcriptions of the dialog I can predict the results of the human judge to a correlation of .5 There was a previous study with essay's a teacher grades that got .8 with LSA. The LSA study used a much larger training corpus outside the individuals.

For slightly more details, here's a sketch of the algorithm: Treat each comment as a "document" input to LDA. Use the theta matrix that represents the distribution of topics over each document. Then use the inverse dot product between two document theta vectors and perform k Nearest Neighbors to predict IDs. You should be able to tune the rank and k values from all the labelled data.

When it comes time to infer I suggest running the with the whole set through LDA instead of reusing the discovered alpha and beta. For some reason (which I'm not entirely sure of), my results seem much better that way.