Hacker News new | ask | show | jobs
by waynecolvin 4387 days ago
Question: Can anybody explain how synthetic singing voice is made? Especially things like pitch, a voice actor doesn't have to sing every syllable at every pitch do they?
4 comments

There are several approaches. I'm not sure what this software uses, machine translation of the site suggests samples.

Pitch shifting samples is one way to do it. A singer is recorded singing a syllable and that is shifted up or down by software. Artifacts creep in relatively quickly, especially with something as nuanced as the human voice. A variety of pitches and syllables can be sampled and the pitch shifting manually tuned to minimize audible artifacts.

Modeling could also be used, from simplistic models not far removed from the ADSR envelopes of basic synthesis to advanced physically based models. Samples and modeling could be combined to expand the palette of syllables.

Our ears and neural processing of speech and singing are finely tuned to process subtle shades of difference so any technique often sounds artificial. Fortunately this can be exploited musically and great music can be made with these 'artifical' sources.

UTAU and Vocaloid do indeed use pitch-shifted samples (UTAU even lets you build your own sample libraries). A more recent product, CeVIO, uses modeling IIRC.
Such things do exist, with much more than single voice actors too: http://www.youtube.com/watch?v=oyijUC1g_yg

Voice "synthesizers" generally use specially developed algorithms. See: https://en.wikipedia.org/wiki/Speech_synthesis#Formant_synth...

It's a combination of several oscillators and complex resonator model.

The oscillator includes the vocal chords as well as a model of the lips for fricatives, plosives, etc.

The resonator includes several major tunable cavities, from the longs to the trachea to the sinuses, nasal cavities and mouth. These resonators form filters called formants which have default configurations for every vowel sound, however they are highly customized for every singer and express a terrific degree of nuance. Synthesis requires a multi-dimensional score somewhat like a speech synthesizer. The score can be dimension-reduced, but it will sound like crap. I would expect it to take about as much time to enter the data as it would to learn to perform it.

I always like this Vocaliod opera song: https://www.youtube.com/watch?v=VseHlKR4Ew8 (Voi che Sapete)

While the song is playing you can see the settings. I've played with it once but it's hard to get it right.

First you place the notes to add pitch and length. Then you attach phonetic codes to the notes. So it's not like you are adding words to notes. It's all about how it should sound. Then you also can add things like amplitude settings.