Hacker News new | ask | show | jobs
by mrob 2917 days ago
If we're abandoning accurate reproduction of sound and just making up anything that sounds plausible, there's already a far more efficient codec: plain text.

Assuming 150wpm and an average 2 bytes per word (with lossless compression), we get about 5bps, which makes 2400bps look much less impressive. Add some markup for prosody and it will still be much lower.

This codec also has the great advantage that you can turn off the speech synthesis and just read it, which is much more convenient than listening to a linear sound file.

3 comments

That codec sounds great, if it exists.

If you have such a codec, it would be worth testing the word error rate on a long sample of audio. e.g. take a few hours of call centre recordings, pass them through each of {your codec, codec2}, and then have a human transcribe each of:

- the original recording

- the audio output from your proposed codec (which presumably does STT followed by TTS)

- the audio output from CODEC2 at 2048

Based on the current state of open source single-language STT models, I would imagine that CODEC2 would be much closer to the original. And if the input audio contains two or more languages, I cannot imagine the output of your codec will be useful at all.

Speech to text is certainly getting better but it makes mistakes. If the transcribed text was sent over the link and then a text to speech spoke at the other end you'd lose one of the great things about codec2 - the voice that comes out is recognisable as it sounds a bit like the person.

A few of us have a contact on Sunday mornings here in Eastern Australia and it's amazing how the ear gets used to the sound and it quickly becomes quite listenable and easy to understand.

Could you elaborate on "a contact"?

Are you using Codec2 over radio?

Yeah, the main use case for codec2 right now is over ham radio. David Rowe, along with a few others, also developed a couple of modems and a GUI program[1]. On Sunday mornings, around 10AM, they do a broadcast of something from the WIA and answer callbacks.

[1] - https://freedv.org/

What you might be able to do is your the text codec as the first pass, then augment the audio with Codec2 or so to capture the extra information (inflections, accent, etc...), for something in between 2 and 700bps.