| HN Mirror

HeyGen (and our V1 model) literally uses the user on-boarding video in the final output. See here for a demonstration of this (https://toinfinityai.github.io/v2-launch-page/#comparisons). We are not talking about that in this thread. We are trying to solve a quirk of our Diffusion Transformer model (V2 model).

Our V2 model is trained on specific durations of audio (2s, 5s, 10s, etc) as input. So, if give the model a 7s audio clip during inference, it will generate lower quality videos than at 5s or 10s. So, instead, we buffer the audio to the nearest training bucket (10s in this case). We have tried buffering it with a zero array, white noise and just concatenating the input audio (inverted) to the end. The drawback is that the last frame (the one at 7s) has a higher likelihood to fail. We need to solve this.

And, no shade on HeyGen. It's literally what we did before. And their videos look hyper realistic, which is great for B2B content. The drawback is you are always constrained to the hand motions and environment of the on-boarding video, which is more limiting for entertainment content.