|
I'll try to answer these one at a time. 1. Does text-to-speech require AI? This one is a bit tricky to answer since it requires defining "AI". AI as a moniker has been used to describe deep neural networks, search algorithms, expert systems and logic systems, particle filters, SVMs, etc etc. Almost all text-to-speech (TTS) systems are based on a combination of some of these machine-learning methods and digital signal processing (DPS), so I would say yeah, text-to-speech is exactly what AI describes, even if it doesn't resemble human-like thinking like other AI applications do. 2. Is there any active work in non-AI methods? This one again is a bit tricky for the same reason as before. However, there's a ton of pieces of the TTS pipeline that aren't AI in the current sense of the word (machine learning with neural networks or HMMs or other classifiers). For example, concatenative systems will traditionally take a large database of audio, divide it into chunks, and then recombine those chunks, using some interpolation method such as (OLA, PSOLA) to overlap those chunks. Choosing the chunks to overlap to create the target speech becomes an AI / search problem, using some sort of acoustic model to predict the acoustic parameters of each frame and then using a Viterbi search algorithm with target / join costs to find the optimal chunks. As another example of non-AI parts of the pipeline, text normalization tends to involve a lot of hand-written rules; for example, should you say "5/10/2019" as "May tenth, twenty nineteen", "the tenth of may twenty nineteen", "the tenth of may two thousand nineteen", or even "october fifth twenty nineteen". This decision and the conversion is often done with a ton of handwritten rules or grammars (see Kestrel, Google's text normalization system, and the open-source version, cleverly named Sparrowhawk). Anyways, the real answer is that TTS is always a combination of AI (machine learning) approaches with specialized text and audio processing algorithms. 3. Which bits are the AI bits? The AI bits are the bits where you need to make some sort of heuristic decisions, and you'd like to make them by imitating some target speech. For example, things like part of speech detection, predicting acoustic parameters (spectrograms, F0, etc), more recently waveform synthesis as well. 4. Do deep methods significantly improve on the state of the art? Yes, though they also come at a cost. For example, deep sequence-to-sequence networks make great frame-level models: Tacotron and similar models can do things like emotional and stylized voice synthesis much better than what I've seen HMMs and other non-deep models do. As another example, WaveNet / WaveRNN / etc are some of the only parametric speech models (that is, generating the waveform from scratch instead of copying it from a database of audio) that can match the quality of concatenative models (copying audio from a database), but they can be quite difficult to deploy due to high computational cost. Overall, though, yeah, deep methods and all the improvements to neural networks in the past few years are having a profound impact on the quality and naturalness of TTS. |
[1] https://en.wikipedia.org/wiki/Linear_predictive_coding
(e.g. as seen in https://en.wikipedia.org/wiki/Texas_Instruments_TI-99/4A)
[2] http://www.festvox.org/festival/