| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rryan 1190 days ago
	This is ... not what I expected. It's basically wiring up pre-trained models to ChatGPT via a router and "modality transformations" (a.k.a speech-to-text and text-to-speech). I expected it to be a GPT-style model that processes audio directly to perform a ton of speech and maybe speech-text tasks in a zero-shot manner.

3 comments

Take a look at AudioLDM (https://github.com/haoheliu/AudioLDM), it might be more what you expected:

- Text-to-Audio Generation: Generate audio given text input.

- Audio-to-Audio Generation: Given an audio, generate another audio that contain the same type of sound.

- Text-guided Audio-to-Audio Style Transfer: Transfer the sound of an audio into another one using the text description.

so then the training data is text, not audio?