Hacker News new | ask | show | jobs
by rhdunn 772 days ago
Isn't the issue more that traditional ASR systems use US (General American) phonetic transcriptions so then struggle with accents that have different splits and mergers?

My understanding on whisper is that it is using a model trained on different accents, specifically from LibriVox. The quality would depend on the specific model selected.

The MFCC or other acoustic analysis is to detect the specific phonemes of speech. This is well understood (e.g. the first 3 formants corresponding to the vowels and their relative positions between speakers), and the inverse is used for a lot of the modern TTS engines where the MFCC is predicted and the waveform reconstructed from that (see e.g. https://pytorch.org/audio/stable/transforms.html).

Some words can change depending on adjacency to other words, or other speech phenomena (like H dropping) can alter the pronunciation of words. Then you have various homophones in different accents. All of these make it hard to go from the audio/phonetic representation to transcriptions.

This is in part why a relatively recent approach is to train the models on the actual spoken text and not the phonetics, so it can learn to disambiguate these issues. Note that this is not perfect, as TTS models like coqui-ai will often mispronounce words in different contexts as the result of a lack of training data or similar issues.

I'm wondering if it makes sense to train the models with the audio, phonetic transcriptions, and the text and score it on both phonetic and text accuracy. The idea being that it can learn what the different phonemes sound like and how they vary between speakers to try and stabilise the transcriptions and TTS output. The model would then be able to refer to both the audio and the phonemes when making the transcriptions, or for TTS to predict the phonemes then the phonemes as an additional input with the text to generate the audio -- i.e. it can use the text to infer things like prosody.