Hacker News new | ask | show | jobs
by InspiredIdiot 1152 days ago
I think parent is saying that the model does not require any paired samples of the voice to be synthesized and corresponding text. So based on my understanding:

one shot - given the text "run faster" along with Alan Greenspan's voice pronouncing that phrase, the model can produce Alan Greenspan's voice saying any other phrase

zero shot - given only Alan Greenspan's voice pronouncing "run faster" but no text version of what was said, the model can produce Alan Greenspan's voice saying any other phrase

1 comments

Does that mean a shot is text?