| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by joeroot 4392 days ago
	I see, apologies! I thought you had pulled in all episodes with a score > 7. I wonder how easily arcs can be identified. I'll try running the transcripts through a topic model this evening.

2 comments

jkaunisv1 4392 days ago

You say that so casually. As someone with no experience in NLP, do you just have topic model algo's or a toolkit lying around? Going on your website this isn't your field of expertise, have you worked on this stuff in the past or just a hobby?

link

joeroot 4392 days ago

Enough to get by! I worked on this throughout university, and now at my startup. I was slightly blasé however! The corpus is tiny (173 episodes: http://www.chakoteya.net/ds9/episodes.htm), so a topic model is unlikely to yield anything valuable. There are probably around 10-15 arcs, and simple clustering could be better -- but this is purely hypothetical. In this case, it's simply curiosity.

If you're interested in tools, Mallet (http://mallet.cs.umass.edu/) is a fairly good place to start, and the original LDA paper by Blei, Ng & Jordan (http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ...) is a great academic starting point.

link

jkaunisv1 4384 days ago

Cool, thanks!

link

tessa_t 4392 days ago

Most of the arc episodes are labeled as part of an arc, like "When it rains... (5)"

link