|
We're used to hearing some kind of identity behind voices -- we unconsciously sense clusters of vocabulary, intonation patterns, ticks, frequent interruption vs quiet patience, silence tolerance, response patterns to various triggers, etc that communicate a coherent person of some kind. We may not know that a given speaker is a GenX Methodist from Wisconsin that grew up at skate parks in the suburbs, but we hear clusters of speech behavior that lets our brain go "yeah, I'm used to things fitting together in this way sometimes" These don't have that. Instead, they seem to mostly smudge together behaviors that are just generally common in aggregate across the training data. The speakers all voice interrupting acknowledgements eagerly, they all use bright and enunciated podcaster tone, they all draw on similar word choice, etc -- they distinguish gender and each have a stable overall vocal tone, but no identity. I don't doubt that this'll improve quickly though, by training specific "AI celebrity" voices narrowed to sound more coherent, natural, identifiable, and consistent. (And then, probably, leasing out those voices for $$$.) As a tech demo for "render some vague sense of life behind this generated dialog" this is pretty good, though. |