|
|
|
|
|
by magicalhippo
164 days ago
|
|
With LLMs I've seen zero-shot used to describe scenarios where there's no example, it "take this and output JSON", while one-shot has the prompt include an example like "take this and output JSON, for this data the JSON should look like this". Thus if you feed a the model target voice, ie an example of the desired output vouce, it sure seems like it should be classified as one-shot. However it seems the zero-shot in voice cloning is relative to learning, and in contrast to one-shot learning[1]. So a bit overloaded term causing confusion from what I can gather. [1]: https://en.wikipedia.org/wiki/One-shot_learning_(computer_vi... |
|
In voice cloning, the reference audio is simply the input, not a training example. You wouldn't say an image classifier is doing "one-shot learning" just because you fed it one image to classify. That image is the input. Similarly, the reference audio is the input that conditions the generation. It is zero-shot because the model's weights were never optimized for that specific speaker's manifold.