|
|
|
|
|
by babakd
1263 days ago
|
|
The speaker prompt is the sample speaker voice reading a random text, that’s one piece that the model uses as input. The second column corresponds to the human speaker reading the text (ground truth) The two next columns are baseline and VALL-E producing text-to-speech respectively, given the first column and only the text as input. |
|