|
|
|
|
|
by gwern
995 days ago
|
|
No, that's easy. We had the equivalent of that in GANs many years ago. If you've never seen GAN editing, here's a quick video: https://www.youtube.com/watch?v=Z1-3JKDh0nI (Background: https://gwern.net/face#reversing-stylegan-to-control-modify-... ) You just classify the latents and then you can edit it. These days, with pretrained models like CLIP, you don't necessarily even need a latent space: you can take a model which has been trained on sound/text descriptions, like AudioCLIP, prompt it with a text like "vocal fry", and then the generated samples are subtly skewed to try to maximize similarity with "vocal fry". You put a slider on that for how much weight/skewing it does, and now you have a slider control to adjust properties of the voice from the AI. If something like this doesn't exist, it's obvious how to do it. (Even the realtime problem is being solved by figuring out how to train diffusion models to do a GAN-like single pass: https://arxiv.org/abs/2309.06380 ) |
|
It’s a really different world now I’ve got massive models running on my laptop thanks to Apple Silicon and the unified memory architecture, and the c++ ports of various diffusion image models and several families of large language text models work well on my AMD gpu too… it’s so much easier to participate in the current generation of applied ML work without having to go out of my way to have specific ML supported hardware.