|
|
|
|
|
by InspiredIdiot
1152 days ago
|
|
I think parent is saying that the model does not require any paired samples of the voice to be synthesized and corresponding text. So based on my understanding: one shot - given the text "run faster" along with Alan Greenspan's voice pronouncing that phrase, the model can produce Alan Greenspan's voice saying any other phrase zero shot - given only Alan Greenspan's voice pronouncing "run faster" but no text version of what was said, the model can produce Alan Greenspan's voice saying any other phrase |
|