Hacker News new | ask | show | jobs
by programjames 598 days ago
My guess:

1. `codec`: First, compress 16k samplerate audio into 8 samples per second with convolutions. Then, vector quantize to 128 bits (probably 8 floats) to get a codec. This is not nearly enough bits to actually represent the audio, it's more to represent phenomes.

2. `vae` -> This looks like a VAE-based diffusion model, that uses the codec as its prompt.

3. `dev` -> This is a next-codec prediction model.

Put together, it probably runs like so:

1. Turn your prompt into tokens with the `codec`.

2. If you want s more seconds of audio, use `dev` to predict 8 * s more tokens.

3. Turn it back into audio with the `vae` diffusion model.

1 comments

I dont actually see any tokens used in the model. It seems like the model actually predicts latents and then VAE converts back to audio. More like Tortoise or XTTS