|
|
|
|
|
by TeMPOraL
1884 days ago
|
|
You can easily process the audio on the fly and reduce it to a probabilistic estimate of whether a tag from a predefined topic set was present in the conversation. Doesn't need to be 100% accurate. You don't need to store the audio - just stream it through the recognizer. The output of such recognizer will be something on the order of 8-32 bytes (an int for tag, a float for probability, an int64 for timestamp), possibly less if one's clever - and it only needs to be stored until the next opportunity to send it out. Also: people seem to be looking at modern speech recognizers on their phones and wrongly concluding that speech recognition in general is very compute-intensive. It isn't, if you're willing to make some sacrifices on accuracy and generality, and to do it locally instead voice data off to a cloud somewhere. A proper benchmark here isn't Siri or Google Assistant - it's Microsoft Speech API, as shipped with Windows 12+ years ago. |
|