Hacker News new | ask | show | jobs
by DharmaPolice 992 days ago
I must admit I found that element of the story...surprising. Has realtime faking gotten that good yet? Presumably there was back and forth in this call so this person was either disguising their voice or typing responses to be generated on the fly.

I know all this can be done, I'm just surprised it's reached the maturity where an attacker would choose to impersonate someone the call recipient presumably knew vs just being a vague "Bob from IT".

Although to be fair the article does say the employee was suspicious so maybe there was a delay which (if you were looking for it) you would spot.

2 comments

You could probably reduce the "delay" by using a soundboard of pre-generated filler material and playing that while you type the real response. "Let me find that bookmark", "So the thing about that is...", "ummm yeah. so...", "hmmm no not really"

You can also use text macros to type the response faster. Here they were trying to get MFA access, so you could map longer phrases that will come up often like "Okta multi factor authentication" to numpad 1. Company name to numpad 2. IT supervisor name to numpad 3.

If you know the target of the conversation you can tailor what you pre generate. I like to mess with scam callers when I get one, and I've noticed some are using some kind of soundboard with a woman's voice (I'm pretty positive it is real and not AI) and they have a planned flow / script. If you try to deviate from the script they have some options to bring you back into it. If you ask them to repeat something you can notice it's the exact same audio snippet as before. If you accuse them of being a bot they have a few samples of the woman being shocked and mildly embarrassed. "Oh my goodness, do I really sound like a bot? No it's just been a long work day for me. I'm sorry about that."

Why type or use a sound board. You aren’t thinking mission impossible enough

Live transcribing in realtime has been a thing for, forever, so there’s no reason for me to think this couldn’t all be glued together into a “voice changer” like the typical super deep “I have your son give me a million dollars” boxes, except instead of doing frequency modulation it is pipes to a model trained on someone’s voice, and applies it. Transcribing to text probably isn’t even needed because why would it be for machine to machine modification. It only needs to go to text for human consumption.

Raw pcm bits from audio in -> AI model trained on victims voice -> line out to phone or voip app.

We totally have the compute to do that. Probably with our phones.

I can't remember which election it was, but the 3D animated character was pushing the limits of real-time rendering for its day when he appeared on a morning talk show and answered questions live. So the live thing has been around for quite some time. The deep fake just allows for the models to look believable. Once you have a model, you can make it do anything.
Faking a famous person would seem to me to be easier (for various reasons) than faking my colleague. It's not enough to fake the sound of their voice, it's also the manner in which they speak - word choice, attitude, responses, knowledge, sense of humour etc. But I'm guessing the target of this attack only knew the fake person they were speaking to marginally.

The approach seems to be unnecessarily risky vs just phoning up pretending to be someone they didn't know is my point.