| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by shtack 627 days ago
	Cool, I built a prototype of something very similar (face+voice cloning, no video analysis) using openly available models/APIs: https://bslsk0.appspot.com/ The video latency is definitely the biggest hurdle. With dedicated a100s I can get it down <2s, but it's pricy.

1 comments

leobg 627 days ago

This looks awesome. Didn’t seem to hear me, but the video looks great. Can you share what models you are using? You say these are all open models.

link

shtack 626 days ago

The model doing the heavy lifting is https://github.com/Rudrabha/Wav2Lip

Mic permissions on mobile are tricky, which might have been your issue? Note in this prototype you also need to hold the blue button down to speak.

link

leobg 626 days ago

Interesting. I didn’t think you could get anything close to realtime with Wav2Lip.

link

shtack 626 days ago

With a dedicated GPU and some cleverness it can be relatively quick. I split the response on punctuation and generate smaller clips in a pipeline. I haven't taken the model apart to try streaming the frames coming out of ffmpeg yet, but that would probably help a lot.

link