|
|
|
|
|
by dangrsmind
7047 days ago
|
|
I'd say I'm a little skeptical. The first question I'd have is how fast they can parse video. The second is how much it costs to do it. It seems you would have to be able to do recognition much faster than real-time for a realistic web video search capability (see for example http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=599600) and you would certainly need a lot of hardware to do this at scale for millions of video clips. See also: http://www.newmediamusings.com/blog/2005/09/blinkx_a_citize.html |
|
That paper is 10 years old. As I'm sure you can imagine, there have been improvements in the field since then. To be completely honest, I don't stay on top of search applied to speech, but the keyword you want is "Spoken Document Retrieval" (SDR). Ciprian Chelba and TJ Hazen do cool stuff in this area; they are giving a tutorial at ICASSP this year SDR.
An aside. Both of these approaches use the fact that when you process speech, you essentially form a graph of words (or phonemes). Paths through the graph represent possible transcriptions. So, since graph is a denser, richer thing to search than the transcript, and we've got graph algorithms sitting around, there are neat tricks you can do to build a search engine index for speech...
I've recently been reading some interesting work that uses locality-sensitive hashing to search audio. The Google speech people are presenting a lot of it at ICASSP this year. See this post for more, and chase the links in their papers for even more: <http://googleresearch.blogspot.com/2007/02/hear-here-sample-of-audio-processing.html>