| > Interesting that there isn't a mention of Orpheus as prior art either Llasa-3b (https://huggingface.co/HKUSTAudio/Llasa-3B) came out before Orpheus (https://huggingface.co/canopylabs/orpheus-3b-0.1-ft). > it's the exact same thing. They're very similar, but they're not the exact same thing. Llasa uses xcodec2, a much simpler, lossless 16khz wav codec. This makes it superior for one-shot voice cloning. Orpheus' 24khz snac codec is lossy which makes it difficult to use for zero-shot cloning as the reference audio gets degraded during tokenization. You can test this here:
https://huggingface.co/spaces/Gapeleon/snac_test But when finetuned on 50+ audio samples, it produces much cleaner 24khz audio than Llasa, and the snac model is much easier to run on consumer hardware than xcodec2 (87t/s for realtime speech, which can be achieved on an RTX3080 for example) |
Zonos uses 128-float embeddings for voices and it seems so much nicer. Because you can just mix and match voices without changing the model.