Hacker News new | ask | show | jobs
by danShumway 1996 days ago
Eh, this technology currently falls very squarely into the category of "almost good enough that I could use it for a creative project, but not nearly good enough that you're going to be able to convince me that the results aren't generated."

I'm not primarily interested about the dollars, I'm interested in allowing communities to do creative things. I think people are looking at this tech like it's only going to be used for deepfakes, and they're underestimating the extent it's going to be used to create voice-acted game mods, animations, anonymization tools, and other creative/helpful projects.

If you're really worried about this stuff though, you can take some comfort in the fact that by far the worst examples on the site are of real-world voices. This is currently technology that as far as I can see is far more suited for generating new voices or voicing cartoon characters with well-defined patterns/inflections than it is for imitating the president.

3 comments

It really doesnt have to be perfect to trick someone. You're expecting this site to be fake so you're listening carefully. If you weren't expecting anything and you were in the middle of a busy day at work, you are much much less likely to notice any discripencies.

We already have stories like https://www.forbes.com/sites/jessedamiani/2019/09/03/a-voice...

That said, as far as harms go, i dont think this is all that bad that it should preclude creative uses of this technology.

You are looking at the current implementation and not thinking about the implication.

One, this tech absolutely could be used to fool someone. Not everyone will be listening with a critical ear. Played back over a phone or injecting a phrase or two in otherwise spoken samples will fool many people.

I guarantee you someone will be using this to make their own MLP episodes on YouTube specifically designed to scare children or get them to do awful things.

Models presumably get better over time. It really won't be too much longer until people will be able to fake celebrities, politicians, exes, authority figures, etc. As a fairly benign example, if I had this in high school you better believe I could have called to excuse some of my absences.

I agree, I love the idea of generating some decent voice lines for my own games projects, but this also introduces issues of the rights of the original voice actors.

If you train a model to mimic a performance given by an actor, then use that model and fire the actor, isn't that potentially really problematic? (Also, it draws parallels to the Luddites who were not anti technology, but wanted to ensure that technology wasn't used in a way that reduced worker quality of life.)

And yes, I think there are helpful ways this could be deployed. I'm gender fluid, and I'd love to be able to adjust my voice digitally, but we need to be thinking about how this could cause harm first.

> One, this tech absolutely could be used to fool someone.

The problem I have here is that it's already not hard to fool people. I don't think it's feasible for us to say that we're going to put something that could be highly beneficial on hold just because we don't want to deal with social education efforts that we kind of already need to tackle anyway. Per your example, if we get rid of deepfakes, it's not clear to me that Youtube is going to be any more safe. I already would not allow a child to browse Youtube unattended, people already generate the videos you're talking about.

And I know that people are putting this in a different category than general CGI, voice modulation, or consumer-grade apps like Photoshop. I'm not going to argue that it's necessarily wrong for people to be worried, but no matter how many times people tell me that this is fundamentally different, I still have not seen any serious evidence that this technology is going to be more dangerous than Photoshop, and I think it's going to be way easier to detect than a decent Photoshop job is. Photoshop's content-aware paste/fill tools are better than this example, and they arguably require less work to use.

And again... I'm sympathetic to concerns about moving too fast, but I just don't think there's any world, even if you could get rid of deepfakes entirely, where we don't need to be worried about media literacy and general skepticism. If people today don't realize that voices can already be convincingly faked, then that's a really serious problem, and if democratizing that ability causes society in general to become more aware of the potential of disinformation, then honestly that might even be a good thing that we should be encouraging.

So sure, concerns, but in my mind people are focusing on one particular implication that I don't think is particularly likely, and ignoring that responding to that concern is probably going to look the same no matter what our position on deepfakes is.

> If you train a model to mimic a performance given by an actor, then use that model and fire the actor, isn't that potentially really problematic?

I think that's a very complicated question. I would not assume that the loss of work for voice actors, who can shift into voice generation roles, is going to be a big enough downside that it overrules the upside of allowing ordinary people to start generating their own vtube avatars or commenting on and building on top of existing culture.

> If people today don't realize that voices can already be convincingly faked, then that's a really serious problem, and if democratizing that ability causes society in general to become more aware of the potential of disinformation, then honestly that might even be a good thing that we should be encouraging.

I've wondered about that angle as well. You can't put the genie back in the bottle, so maybe the best way to combat the threat of deepfaked misinformation is actually to take the opposite approach and make it as easy as possible for normal people to generate their own deepfakes; that way it becomes common knowledge that such things are possible (similar to how photoshop is common knowledge today).

> If you train a model to mimic a performance given by an actor, then use that model and fire the actor, isn't that potentially really problematic?

And if you have to keep getting a person paid for something that a machine could do with (assuming, as per your post) 100% equal performance, that is not problematic? When the voice becomes as good as real actors, then yes of course they should become out of a job. Just like progress has been going on for thousands of years.

I am thinking it could be used to impersonate someone in a phone call to a family member for conning.
I might be misunderstanding you, but there are no real-world voices on the site? All of them are of characters.
I see a pretty linear drop in quality from Glados to Spongebob to Twilight Sparkle to the narrator from Stanley Parable to the 10th Doctor.

It seems to struggle more and more as the voices get less cartoony/exaggerated.

I'm not too sure about that. From my testing, Fluttershy, Applejack, Twilight, Chrysalis, Rise, and Kyu (and a bunch of other characters that I'm surely forgetting) seem to perform phenomenally well. Especially Chrysalis, her emotions are extremely believable, and Fluttershy/Applejack/Rise/Kyu have almost zero noise for every generation. This might be the most impressive site I've ever seen.

Oh, I somehow forgot all of the TF2 characters. Some of them do struggle (Medic the most, I think) but everyone else seems incredibly good.

And the Daria characters, too. Honestly, the vast majority of characters are already near-perfect.

Hrm. Well, I can't really argue with that beyond that my standards on perfect might be different.

I think some of the best voices they have are characters like Twilight, she shows a ton of promise. But as it stands right now, I would still at least hesitate to use Twilight's voice in a project unless I didn't have other options. Chrysalis's voice is good, but again, is an exaggerated cartoon character with a large amount of inflection. I would not use her voice in her current state without a lot of post-processing. Someone like the Spy I would consider to be unusable, it sounds to me like the character needs to clear their throat or something, it's got a lot of strange artifacts. I definitely would consider the 10th Doctor unusable, even for just a hobby project or a voice assistant.

But... I don't know, maybe this is subjective. I can't just tell you that what you're hearing is wrong, if you like the results then you like the results :)

And again, I don't want to detract from how impressive they are. They are incredibly impressive, particularly because of how characters like Chrysalis emote. Extremely promising. But I still think there's a difference between 'impressive' and 'believable deepfake'.

Yeah, that's fair. I dunno, I can't really hear anything wrong with Fluttershy or Applejack no matter how hard I try, but your ears are probably much better than mine :p

I've been seeing quite a few skits being posted on /r/tf2 (https://www.reddit.com/r/tf2/comments/kr374q/honestly_idk_i_...) and all of the voices sound pretty much perfect to me. But as you said, it's subjective.