|
|
|
|
|
by derefr
166 days ago
|
|
Is there yet any model like this, but which works as a "speech plus speech to speech" voice modulator — i.e. taking a fixed audio sample (the prompt), plus a continuous audio stream (the input), and transforming any speech component of the input to have the tone and timbre of the voice in the prompt, resulting in a continuous audio output stream? (Ideally, while passing through non-speech parts of the input audio stream; but those could also be handled other ways, with traditional source separation techniques, microphone arrays, etc.) Though I suppose, for the use-case I'm thinking of (v-tubers), you don't really need the ability to dynamically change the prompt; so you could also simplify this to a continuous single-stream "speech to speech" model, which gets its target vocal timbre burned into it during an expensive (but one-time) fine-tuning step. |
|
There are two inputs: audio A (“style”) and B (“content”). The timbre is taken from A, and the content, pronunciation, prosody, accent, etc is taken from B.
Strictly soeaking, voice cloning models like this and chatterbox are not “TTS” - they’re better thought of as “S+STS”, that is, speech+style to speech