| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by synapticpaint 1213 days ago

Here is a preliminary test for video editing using ControlNet I made: https://www.youtube.com/watch?v=u52MOA4YaGk

As you can see, there is still quite a bit of flicker, I'm working to reduce that. But the consistency is much better compared to, say, img2img.

I'm hoping to ship a prototype this week.

4 comments

guiambros 1213 days ago

Haven't read the paper yet, but curious how different ControlNet is from Text2LIVE ([1], [2]). Seems it's solving the same problem with temporal consistency, no?

[1] https://www.youtube.com/watch?v=8U9o5aZ2y5w

[2] https://text2live.github.io/

link

synapticpaint 1212 days ago

No, ControlNet wasn't made to solve temporal consistency, it was made to add more control (hence the name) to image models. I am using it in a way that the authors may not have thought of, because the paper doesn't mention video editing.

link

meghan_rain 1213 days ago

S-so the girl on the right in the second half of the video is ot real...?

link

synapticpaint 1213 days ago

Correct, it's generated by SD.

link

refulgentis 1213 days ago

really curious about this too OP, is the face generated? How do you keep temporal consistency with that?

link

synapticpaint 1213 days ago

So, the video was generated by applying ControlNet to the input video frame by frame. Every inference setting is the same for every frame -- seed, prompt, CFG, steps, and sampler. The only thing that changes frame to frame is that the pose changes slightly. So actually, if SD was well behaved, you would expect the difference between adjacent frames to be small, because the change in the input is small. But SD is somewhat schizophrenic so you get this amount of flicker from even small changes in input.

I also had to specify what the outfit should be (I got a lot more discrepancies when I didn't do this from the outfit changing frame to frame). You can see that the outfit changes color in the second version, I bet you can get that to be even more consistent if you specify the color in the prompt too.

If you create a dreambooth model of a character, you can probably also get consistency of the face that way. In this case I didn't need to do this because I didn't care who I got, I just asked for an "average woman".

link

meghan_rain 1213 days ago

Is there like a "temperature" setting you can change? And set it to 0 to produce less flickering?

link

Lerc 1213 days ago

The flickering comes from the fundamental nature of the de-noising mechanism involved in the diffusion model. The ability to create multiple novel images for the same input comes from adding noise with a random seed. Currently this is more or less done every frame which is why you get the flickering. Keeping the same seed wouldn't be helpful if you want the image to move.

What could be of use here is a noise transformation layer that can use the same noise for every frame but transformed to match desired motion. For video conversion you could possibly extract motion vectors from successive frames to warp the noise.

I assume someone is working on this somewhere.

link

synapticpaint 1212 days ago

"The flickering comes from the fundamental nature of the de-noising mechanism involved in the diffusion model." -- agreed

"Keeping the same seed wouldn't be helpful if you want the image to move." -- No, I'm using the same seed (and prompt). The image moves because ControlNet opens up another channel of input, in this case the pose data.

link

shubb 1213 days ago

I wonder if putting an adversary network on top would reduce the flickering. A mechanism that only accepts a frame of it is detected to be three next frame in a video of the same person, otherwise regenerate

link

synapticpaint 1212 days ago

Not really, but I think there are other things you can do to reduce flickering that I'm looking into.

link

refulgentis 1213 days ago

that’s really really really good, I have an overtrained Dreambooth model I was using with controlnet and even mine was flickering in the face more than this

link

synapticpaint 1212 days ago

Are you using canny mode? One of the other modes (HED, segmentation, or depth) may give you more consistency. Lmk how it goes if you try this.

link

thom 1213 days ago

Impressive that it doesn’t just do pose transfer but applies the correct reverse kinematics too (hand on the wall/rail etc).

link

synapticpaint 1212 days ago

Yes, it's asking SD for an image with some set of characteristics, and SD has some notion of what is plausible from what it saw during training.

link

jpeter 1213 days ago

Can you mask the background out?

link