Hacker News new | ask | show | jobs
by rryan 1143 days ago
This is ... not what I expected. It's basically wiring up pre-trained models to ChatGPT via a router and "modality transformations" (a.k.a speech-to-text and text-to-speech).

I expected it to be a GPT-style model that processes audio directly to perform a ton of speech and maybe speech-text tasks in a zero-shot manner.

3 comments

Take a look at AudioLDM (https://github.com/haoheliu/AudioLDM), it might be more what you expected:

- Text-to-Audio Generation: Generate audio given text input.

- Audio-to-Audio Generation: Given an audio, generate another audio that contain the same type of sound.

- Text-guided Audio-to-Audio Style Transfer: Transfer the sound of an audio into another one using the text description.

so then the training data is text, not audio?
you might be interested in suno-ai/bark