| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by drhodes 692 days ago
	just an idea, but what if the appended audio clip was reversed to ensure continuity in the waveform? That is, if >< is the splice point and CLIP is the audio clip, then the idea would be to construct CLIP><PILC.

1 comments

andrew-w 692 days ago

This is exactly what we do today! It seems to work better the more you extend it, but extending it too much introduces other side effects (e.g. the avatar will start to open its mouth, as if it were preparing to talk).

link

drhodes 692 days ago

Hmm, maybe adding white noise would work. -- OK, that's quite enough unsolicited suggestions from me up in the peanut gallery. Nice job on the website, it's impressive, thank you for not requiring a sign up.

link

andrew-w 692 days ago

All for suggestions! We've tried white noise as well, but it only works on plain talking samples (not music, for example). My guess is that the most robust solution will come from updating how it's trained.

link

bobbylarrybobby 691 days ago

What if you train it to hold the last frame on silence (or quiet noise)?

link

andrew-w 691 days ago

We've talked about doing something like that. Feels like it should work in theory.

link

jazzyjackson 691 days ago

Or noise corresponding with a closed mouth

Hmmmmmmmm

Ohmmmmmmm

link

swyx 690 days ago

hmm weird, i thought you criticise heygen for doing exactly that (mirroring the input)

link

sidneyprimas 690 days ago

HeyGen (and our V1 model) literally uses the user on-boarding video in the final output. See here for a demonstration of this (https://toinfinityai.github.io/v2-launch-page/#comparisons). We are not talking about that in this thread. We are trying to solve a quirk of our Diffusion Transformer model (V2 model).

Our V2 model is trained on specific durations of audio (2s, 5s, 10s, etc) as input. So, if give the model a 7s audio clip during inference, it will generate lower quality videos than at 5s or 10s. So, instead, we buffer the audio to the nearest training bucket (10s in this case). We have tried buffering it with a zero array, white noise and just concatenating the input audio (inverted) to the end. The drawback is that the last frame (the one at 7s) has a higher likelihood to fail. We need to solve this.

And, no shade on HeyGen. It's literally what we did before. And their videos look hyper realistic, which is great for B2B content. The drawback is you are always constrained to the hand motions and environment of the on-boarding video, which is more limiting for entertainment content.

link

swyx 690 days ago

i already love you guys more than them bc of how transparent you are. keep it up!!

link