| HN Mirror

Human speech production/perception works by articulation changing the shape, hence resonant frequencies (formants), of the vocal tract, and our ear/auditory cortex then picking up these changing formants. We're especially attuned to changes in the formants since those correspond to changes in articulation. The specific resonant frequency values of the formants vary from individual to individual and aren't so important.

Similarly the sound source (aka voice) for human speech can vary a lot from individual to individual, so serves more to communicate age/sex, emotion, identity, etc, not actual speech content (formant changes).

The reason articulatory synthesis (whether based on a physical model of the vocal tract, or a software simulation of one) and formant synthesis sound so similar is because both are designed to emphasize the formants (resonant frequencies) in a somewhat overly-precise way, and neither typically do a good job of accurately modelling the voice source, and other factors that would make it sound more natural. The ultimate form of formant synthesis just uses sine waves (not a source + filter model) to model the changing formant frequencies, and is still quite intelligible.

The "Daisy" song somehow became a staple for computer speech, and can be heard here in the 1984 DECtalk formant-synthesizer version. You can still pick up DECtalks on eBay - an impressive large VCR-sized box with a 3" 68000 processor inside.

https://en.wikipedia.org/wiki/Daisy_Bell