Hacker News new | ask | show | jobs
by sumtechguy 996 days ago
What I find interesting is this aspect that eventually, these companies will hire some college kids who needs a couple thousand bucks and a free pizza. Have them read the right scripts. Sign the right 'give everything away' contract and just forever use their voice. Or do it sneaky. Have a voice assistant and in your ToS 'we can use a copy of your voice for anything'.

The existing voice actors will be just out of work. There will be a small cadre of groups that want real voice. But for some projects that will not be that important.

Its going to get crazy.

4 comments

They don't need that - they already have enough data to generate plausibly human voices that don't sound like anyone in particular.

Voice cloning is a special case, these models are equally good at making new voices.

I’ve found it’s not actually as easy to get this stuff to sound different to the specific someone it’s trained on.
Don't expect that to last more than a year or two, assuming it's even still a problem for the best voice-generation AIs. Generating high-quality is the hard problem; generating specific high-quality samples is, by comparison, a lot easier.

Remember when Stable Diffusion was released a year ago and one of the big artist copes was "sure, it can generate random images, but it'll never be able to generate the same character repeatedly!" They were already wrong because Textual Inversion and DreamBooth were already published, and soon enough, ported to SD and now people could dump out thousands of images of the same character in the same consistent style etc (and did).

The issue is more that I can’t get the equivalent of a slider control to adjust one or more properties of the voice from the AI in real time… like a vocal fry slider to use an example of something most people are capable of deliberately doing when they want to… but the currently available models are pre-trained to sound like the average/median of one specific person (or character) and while I imagine tools will improve to control and customise the training of the models to customise this vocal output I don’t see a clear path from the current model architectural design to one where I can freely control the stylistic expression aspects of the vocal output without loading in a completely different set of model data trained for that new desired output.
No, that's easy. We had the equivalent of that in GANs many years ago. If you've never seen GAN editing, here's a quick video: https://www.youtube.com/watch?v=Z1-3JKDh0nI (Background: https://gwern.net/face#reversing-stylegan-to-control-modify-... ) You just classify the latents and then you can edit it. These days, with pretrained models like CLIP, you don't necessarily even need a latent space: you can take a model which has been trained on sound/text descriptions, like AudioCLIP, prompt it with a text like "vocal fry", and then the generated samples are subtly skewed to try to maximize similarity with "vocal fry". You put a slider on that for how much weight/skewing it does, and now you have a slider control to adjust properties of the voice from the AI. If something like this doesn't exist, it's obvious how to do it. (Even the realtime problem is being solved by figuring out how to train diffusion models to do a GAN-like single pass: https://arxiv.org/abs/2309.06380 )
I didn’t get to really explore the GAN generation of ML work particularly well since I had no supported hardware (no desire to support the nVidia monopoly on ML work) and refused to blow money on cloud instances I’d probably forget at some point and wind up with a giant bill.

It’s a really different world now I’ve got massive models running on my laptop thanks to Apple Silicon and the unified memory architecture, and the c++ ports of various diffusion image models and several families of large language text models work well on my AMD gpu too… it’s so much easier to participate in the current generation of applied ML work without having to go out of my way to have specific ML supported hardware.

I have said this will initially be sold as a feature on things like Audiobooks.

Pick your book, pick your reader and away it goes. The Diary of Anne Frank read by Gilbert Gottfried.

Not sure if your hypothetical was meant to be a reference to the absolutely hilarious classic “Gilbert Gottfried reads 50 Shades of Gray”, but it has me wondering how much of the inherent comedy comes from “the voice” and how much comes from the idea that the man himself sat down and recorded those lines.

https://youtu.be/XkLqAlIETkA (Extremely NSFW without headphones)

> wondering how much of the inherent comedy comes from “the voice” and how much comes from the idea that the man himself sat down and recorded those lines

For me it came from the voice; I hadn't heard of Gilbert Gottfried as a specific person until I read this discussion. The reaction faces of the women listeners were also amusing.

I still like getting surprised when a new or unorthodox narrator knocks it out of the park but I’d really enjoy a “salvage this purchase” exit hatch with a AI voice alternative. I’d even pay a buck or two on top of an existing purchase to automatically fix a bad narration.

Head over to Audible reviews, some books are widely considered to be great books as written but the audiobook is reviewed as one to be avoided because it was recorded poorly, the narrator paced it wrong, they had an annoying voice, they couldn’t do a voice of the opposite gender, whatever.

Plus it seems like a great accessibility feature. Many books are recorded for the vision impaired community by volunteers and that’s admirable, but some of the AI today does a much better job.

These are some very fair points. There was one book 'Electron Fire' all about the creation of the transistor, I think. I say that because never have I heard a more unenthused narrator. Makes Henry Kissinger sound like a dramatic actor.

Any AI voice could save that one. Any of them! Heck the original voice on the 1984 Machintosh could do better.

Recent voice models by OpenAI, Meta, and ElevenLabs all state upfront they work with paid professional voice actors, so this space will get intetesting fast.
Mozilla has a voice data project where people already do it for free(dom) ;)

https://commonvoice.mozilla.org/en