|
|
|
|
|
by empath-nirvana
793 days ago
|
|
The results are actually shockingly bad, considering that I think this should be _easier_ than producing a realistic image from scratch, which ai does quite well. I don't have more than a fuzzy idea of how to implement this, but it seems to me that key frames _should_ be interchangeable with in between frames, so you want to train it so that if you start with key frames and generate in-between frames, and then run the in-between frames through the ai, it should regenerate the keyframes. |
|
Think of this in terms of constraints. An image from scratch has self consistency constraints (this part of the image has to be consistent with that part) and it may have semantic constraints (if it has to match a prompt). An animation also has the self consistency constraints, but also has to be consistent with other entire images! The fact that the images are close in some semantic space helps, but all the tiny details become so important to get precisely correct in a new way.
Like, if a model has some weird gap where it knows how to make an arm at 45 degrees and 60 degrees, but not 47, then that's fine for from-scratch generation. It'll just make one like it knows how (or more precisely, like it models as naturally likely). Same with any other weird quirks of what it thinks is good (naturally likely): It can just adjust to something that still matches the semantics but fits into the model's quirks. No such luck when now you need to get details like "47 degrees" correct. It's just a little harder without some training or modeling insight into how an arm at 45 degrees and 47 degrees are really "basically the same" (or just that much more data, so that you lose the weird bumps in the likelihood).
I wouldn't be surprised if "just that much more data" ends up being the answer, given the volume of video data on the internet, and the wide applicability of video generation (and hence intense research in the area).