|
|
|
|
|
by ks2048
259 days ago
|
|
Every couple of weeks I see a new TTS model showcased here and it’s always difficult to see how they differ from one another. Why don’t they describe the architecture and details of the trailing data? My cynical side thinks people just take the state-of-the-art open source model, use an LLM to alter the source, minimal fine tuning to change the weights and they are able to claim “we built our own state of the art tts”. I know it’s open source, so I can dig into the details myself, but are they any good high-level overviews of modern TTS, comparing/contrasting the top models? |
|
Architecturally it's similar to other LLM-based TTS models (like OuteTTS) but the underlying LLM makes them able to release it under an Apache 2 license.