Hacker News new | ask | show | jobs
by joeroot 4345 days ago
I see, apologies! I thought you had pulled in all episodes with a score > 7.

I wonder how easily arcs can be identified. I'll try running the transcripts through a topic model this evening.

2 comments

You say that so casually. As someone with no experience in NLP, do you just have topic model algo's or a toolkit lying around? Going on your website this isn't your field of expertise, have you worked on this stuff in the past or just a hobby?
Enough to get by! I worked on this throughout university, and now at my startup. I was slightly blasé however! The corpus is tiny (173 episodes: http://www.chakoteya.net/ds9/episodes.htm), so a topic model is unlikely to yield anything valuable. There are probably around 10-15 arcs, and simple clustering could be better -- but this is purely hypothetical. In this case, it's simply curiosity.

If you're interested in tools, Mallet (http://mallet.cs.umass.edu/) is a fairly good place to start, and the original LDA paper by Blei, Ng & Jordan (http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ...) is a great academic starting point.

Cool, thanks!
Most of the arc episodes are labeled as part of an arc, like "When it rains... (5)"