| Audiobooks are definitely possible for ASR training. Indeed the largest open ASR training dataset before Common Voice was LibriSpeech (http://www.openslr.org/12/). Also note, the first release of Mozilla's DeepSpeech models were trained and tested with LibriSpeech: https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error... But as others have mentioned, there are several problems with audiobooks as an ASR training dataset. First, the language used in literature is often very different from how people actually speak, especially if that language comes from very old texts (which many public domain books are indeed quite old). Then there is the sound profile, which includes background noise, quality of microphone, speakers distance to device, etc. For recorded audio books, the speaker is often using a somewhat sophisticated setup to make the audio quality as clean as possible. This type of setup is obviously unusual when people want to speak to their devices. Third, the tone and cadence of read speech is different than that of spontaneous speech (the Common Voice dataset also has this problem, but they are coming up with ideas on how to prompt for spontaneous speech too). But the goal of Common Voice was never to replace LibreSpeech or other open datasets (like TED talks) as training sets, but rather to compliment them. You mention transfer learning. That is indeed possible. But it's also possible to simply put several datasets together and train on all of them from scratch. That is what Mozilla's DeepSpeech team has been doing since the beginning (you can read the above hacks blog post from Reuben Morais for more context there). |
It shouldn't be that hard to degrade the quality synthetically? And with a clean source you can synthesize different types of noise/distortions.