Cool! Does text-to-speech require AI, or is there any active work in non-AI methods? Which bits are the AI bits? Do "deep" methods substantially improve over whatever classical methods we might have had?
This one is a bit tricky to answer since it requires defining "AI". AI as a moniker has been used to describe deep neural networks, search algorithms, expert systems and logic systems, particle filters, SVMs, etc etc. Almost all text-to-speech (TTS) systems are based on a combination of some of these machine-learning methods and digital signal processing (DPS), so I would say yeah, text-to-speech is exactly what AI describes, even if it doesn't resemble human-like thinking like other AI applications do.
2. Is there any active work in non-AI methods?
This one again is a bit tricky for the same reason as before. However, there's a ton of pieces of the TTS pipeline that aren't AI in the current sense of the word (machine learning with neural networks or HMMs or other classifiers). For example, concatenative systems will traditionally take a large database of audio, divide it into chunks, and then recombine those chunks, using some interpolation method such as (OLA, PSOLA) to overlap those chunks. Choosing the chunks to overlap to create the target speech becomes an AI / search problem, using some sort of acoustic model to predict the acoustic parameters of each frame and then using a Viterbi search algorithm with target / join costs to find the optimal chunks. As another example of non-AI parts of the pipeline, text normalization tends to involve a lot of hand-written rules; for example, should you say "5/10/2019" as "May tenth, twenty nineteen", "the tenth of may twenty nineteen", "the tenth of may two thousand nineteen", or even "october fifth twenty nineteen". This decision and the conversion is often done with a ton of handwritten rules or grammars (see Kestrel, Google's text normalization system, and the open-source version, cleverly named Sparrowhawk). Anyways, the real answer is that TTS is always a combination of AI (machine learning) approaches with specialized text and audio processing algorithms.
3. Which bits are the AI bits?
The AI bits are the bits where you need to make some sort of heuristic decisions, and you'd like to make them by imitating some target speech. For example, things like part of speech detection, predicting acoustic parameters (spectrograms, F0, etc), more recently waveform synthesis as well.
4. Do deep methods significantly improve on the state of the art?
Yes, though they also come at a cost. For example, deep sequence-to-sequence networks make great frame-level models: Tacotron and similar models can do things like emotional and stylized voice synthesis much better than what I've seen HMMs and other non-deep models do. As another example, WaveNet / WaveRNN / etc are some of the only parametric speech models (that is, generating the waveform from scratch instead of copying it from a database of audio) that can match the quality of concatenative models (copying audio from a database), but they can be quite difficult to deploy due to high computational cost. Overall, though, yeah, deep methods and all the improvements to neural networks in the past few years are having a profound impact on the quality and naturalness of TTS.
Thanks very much for your reply, super helpful!! Sorry if that was difficult to answer. I guess I'm interested in how far we've gone from TTS engines like the LPC [1] engines we had in the 80s, or what you get from festival [2]. Maybe there isn't as clear a separation between their methods and the modern Google-scale deep-learning approaches as I thought.
There's a few recent papers actually that show minor improvements by integrating LPC prediction into deep methods ([0], [1]). In my experience (some of which comes from reproducing these, some of which comes from my own experiments), this isn't actually too useful, at at most offers a minor modeling benefit.
The main difference between something like Festival and what we have now is the amount of domain-specific engineering. (This is generally the promise of deep learning -- replace hand-engineered features with simple-to-understand features and a deep model.) If you go and read the Festival manual, you're going to find tons of domain-specific rules and heuristics and subroutines; for example, there's a page on writing letter to sound rules as a grammar [2]. Nowadays, we may have a pipeline that resembles Festival at the high level, but each step of the pipeline is learned as a deep model from data rather than being carefully hand-engineered by many people over the course of years. This yields much more fluid speech as well as much, much faster iteration and experimentation times, leading to faster progress as well.
1. Does text-to-speech require AI?
This one is a bit tricky to answer since it requires defining "AI". AI as a moniker has been used to describe deep neural networks, search algorithms, expert systems and logic systems, particle filters, SVMs, etc etc. Almost all text-to-speech (TTS) systems are based on a combination of some of these machine-learning methods and digital signal processing (DPS), so I would say yeah, text-to-speech is exactly what AI describes, even if it doesn't resemble human-like thinking like other AI applications do.
2. Is there any active work in non-AI methods?
This one again is a bit tricky for the same reason as before. However, there's a ton of pieces of the TTS pipeline that aren't AI in the current sense of the word (machine learning with neural networks or HMMs or other classifiers). For example, concatenative systems will traditionally take a large database of audio, divide it into chunks, and then recombine those chunks, using some interpolation method such as (OLA, PSOLA) to overlap those chunks. Choosing the chunks to overlap to create the target speech becomes an AI / search problem, using some sort of acoustic model to predict the acoustic parameters of each frame and then using a Viterbi search algorithm with target / join costs to find the optimal chunks. As another example of non-AI parts of the pipeline, text normalization tends to involve a lot of hand-written rules; for example, should you say "5/10/2019" as "May tenth, twenty nineteen", "the tenth of may twenty nineteen", "the tenth of may two thousand nineteen", or even "october fifth twenty nineteen". This decision and the conversion is often done with a ton of handwritten rules or grammars (see Kestrel, Google's text normalization system, and the open-source version, cleverly named Sparrowhawk). Anyways, the real answer is that TTS is always a combination of AI (machine learning) approaches with specialized text and audio processing algorithms.
3. Which bits are the AI bits?
The AI bits are the bits where you need to make some sort of heuristic decisions, and you'd like to make them by imitating some target speech. For example, things like part of speech detection, predicting acoustic parameters (spectrograms, F0, etc), more recently waveform synthesis as well.
4. Do deep methods significantly improve on the state of the art?
Yes, though they also come at a cost. For example, deep sequence-to-sequence networks make great frame-level models: Tacotron and similar models can do things like emotional and stylized voice synthesis much better than what I've seen HMMs and other non-deep models do. As another example, WaveNet / WaveRNN / etc are some of the only parametric speech models (that is, generating the waveform from scratch instead of copying it from a database of audio) that can match the quality of concatenative models (copying audio from a database), but they can be quite difficult to deploy due to high computational cost. Overall, though, yeah, deep methods and all the improvements to neural networks in the past few years are having a profound impact on the quality and naturalness of TTS.