Hacker News new | ask | show | jobs
by mikebennett 4788 days ago
Boxfish scientist & dev here.

Closed captioning does have a lot of noise in it, but we've done a lot of work to tidy that up. We also have the benefit of capturing so much data that the noise doesn't matter as much.

In real-time our systems extract and generate lots of information from the closed captions. Our NLP system identifies entities (e.g. whitehouse, chris brown, amanda berry), does frequency counts and we have some very large graphs of entity co-occurrences (leads to our statistical learning), e.g. Rihanna is commonly associated with Chris Brown.

We do a bunch more analysis on our graphs, including Latent Semantic Indexing (LSI), which helps drill down into quantifying the relationships between entities. Related to that we generate TFIDF scores for all identified entities, i.e. gives a sense of how "important" an entity is.

By combining our large scale entity graphs (both Frequency and LSI) with streams of closed captions we also do real-time topic extraction at multiple time scales, e.g. what is the topic of conversation on CNN for the last minute of conversation, for the last 5 minutes, for the whole program, etc.

Feel free to ask me more.

4 comments

What about STT and/or machine translation for foreign channels? It is really frustrating when foreign stations don't even have CC if you're trying to learn the language and need more spoken input.
While we could use STT (some of our team have backgrounds in it) we sought to use the cleanest existing signal, i.e. closed captions.

Part of the motivation for taking a statistical NLP approach is that it gives us more flexibility for processing foreign stations / languages (we don't yet do that).

I wonder could you time and geo-shift closed captions, i.e. show closed captions in two languages at once on the same TV program? That could make an interesting language learning tool and an interesting training set for machine translation.

>>Closed captioning does have a lot of noise in it, but we've done a lot of work to tidy that up.

What's your average accuracy?

Associating Rihanna to Chris Brown is pretty trivial. The amount of ink that correlates them is enormous; that doesn't seem like a very interesting example. I guess it depends on what you are going for.
That is a trivial example - it easily pops out of the data with a pure freq co-occurrence measurement BUT as I said freq is only a starting point. LSI and co generate far more interesting graphs and can be mined in many ways.
Is this system-on-chip?
Unfortunately not. The information extraction aspect involves heavy duty math, where the more computational resources you can throw at the problem the better. We have a bunch of servers where all the cores on the CPUs are running at max 24 hours a day. At the moment, we're not using GPUs for our matrix math (the matrices tend to be very large). For our matrix math I am a fan of the fast ojAlgo math library http://ojalgo.org
Pure Java for matrix math? Seems like that would be something to avoid. Have you benchmarked it?
I used the bench marks from http://code.google.com/p/java-matrix-benchmark

Decision came down to speed is reasonable, library has some other math functions I need, and time taken to build.