|
|
|
|
|
by rryan
1143 days ago
|
|
This is ... not what I expected. It's basically wiring up pre-trained models to ChatGPT via a router and "modality transformations" (a.k.a speech-to-text and text-to-speech). I expected it to be a GPT-style model that processes audio directly to perform a ton of speech and maybe speech-text tasks in a zero-shot manner. |
|
- Text-to-Audio Generation: Generate audio given text input.
- Audio-to-Audio Generation: Given an audio, generate another audio that contain the same type of sound.
- Text-guided Audio-to-Audio Style Transfer: Transfer the sound of an audio into another one using the text description.