Hacker News new | ask | show | jobs
by Jingyi0321 353 days ago
In general, WebRTC VAD uses pitch information for VAD. Note that pitch only appears in voiced speech, but not in unvoiced speech. With this characteristic, WebRTC VAD may fails in detecting the start of a word, losing the unvoiced start, which will then result in e.g. increased WER in ASR system. On the other hand, noise whose spectrum is similar to voiced speech, e.g. music, may be extracted a non-zero pitch by WebRTC VAD pitch detection system.

Our model incorporates fbank and the pitch information together, and can analyse the input pattern deeply, therefore has better performance than WebRTC VAD.