| Yes, I heard the same artifacts! In the normal codec2 decoding it sounds like "seventy" but muffled and crunchy. In the wavenet decoding, the voice sounds clearly higher quality and crisp, but the word sounds more like "suthenty". And not because the audio quality makes it ambiguous but it sounds like it's very deliberately pronouncing "suthenty". It's as if in trying to enhance and crisp up the sound, it corrected in the wrong direction. It sounds like the compressed data that would otherwise code for a muffled and indistinct "seventy", was interpreted by wavenet but "misheard" in a sense. When wavenet reconstructs the speech, it confidently outputs a much clearer/crisper voice, except it locks onto the wrong speech sounds. With the standard "muffled/crunchy" decoding, a listener can sort of "hear" this uncertainty. The speech sound is "clearly" indistinct, and we're prompted to do our own correction (in our heads), but also knowing it might be wrong. When the machine learning net does this correction for us, we don't get the additional information of how its guess is uncertain. This is exactly the sort of artifact I'd expect with this kind of system. As soon as I heard the ridiculously good and crisp audio quality of the wavenet decoder, that fidelity just isn't included in the encoding bits, that's impossible. It's a great accomplishment and just impressive, but it has to "make up" some of those details in a sense very similar to image super resolution algorithms. I'm just thinking we should perhaps be careful to not get into a situation like the children's "telephone" game, if for some reason the speech gets re/de/re/encoded more than once. Which is of course bad practice, but even if it happens by accident, the wavenet will decode into confident and crisp audio, so it may be hard to notice if you don't expect it. If audio is encoded and decoded a few times, it's possible that the wavenet will in fact amplify misheard speech sounds into radically different speech sounds, syllables or even words, changing the meaning. Kind of like the "deep dreaming" networks. Sounds like a particularly bad idea for encoding audio books, because small flourishes in wording really can matter. Edit: I just realised that repeated re/de/re-encoding can in fact happen quite easily if this codec is ever implemented and used in real world phone networks. Many networks use different codecs and re-encoding just has to be done if something is to pass through a particular network. But the whole thing is ridiculously cool regardless :) And I wonder if they can improve on this problem. |