| Boxfish scientist & dev here. Closed captioning does have a lot of noise in it, but we've done a lot of work to tidy that up. We also have the benefit of capturing so much data that the noise doesn't matter as much. In real-time our systems extract and generate lots of information from the closed captions. Our NLP system identifies entities (e.g. whitehouse, chris brown, amanda berry), does frequency counts and we have some very large graphs of entity co-occurrences (leads to our statistical learning), e.g. Rihanna is commonly associated with Chris Brown. We do a bunch more analysis on our graphs, including Latent Semantic Indexing (LSI), which helps drill down into quantifying the relationships between entities. Related to that we generate TFIDF scores for all identified entities, i.e. gives a sense of how "important" an entity is. By combining our large scale entity graphs (both Frequency and LSI) with streams of closed captions we also do real-time topic extraction at multiple time scales, e.g. what is the topic of conversation on CNN for the last minute of conversation, for the last 5 minutes, for the whole program, etc. Feel free to ask me more. |