|
|
|
|
|
by mrob
2917 days ago
|
|
If we're abandoning accurate reproduction of sound and just making up anything that sounds plausible, there's already a far more efficient codec: plain text. Assuming 150wpm and an average 2 bytes per word (with lossless compression), we get about 5bps, which makes 2400bps look much less impressive. Add some markup for prosody and it will still be much lower. This codec also has the great advantage that you can turn off the speech synthesis and just read it, which is much more convenient than listening to a linear sound file. |
|
If you have such a codec, it would be worth testing the word error rate on a long sample of audio. e.g. take a few hours of call centre recordings, pass them through each of {your codec, codec2}, and then have a human transcribe each of:
- the original recording
- the audio output from your proposed codec (which presumably does STT followed by TTS)
- the audio output from CODEC2 at 2048
Based on the current state of open source single-language STT models, I would imagine that CODEC2 would be much closer to the original. And if the input audio contains two or more languages, I cannot imagine the output of your codec will be useful at all.