Hacker News new | ask | show | jobs
by empath-nirvana 793 days ago
The results are actually shockingly bad, considering that I think this should be _easier_ than producing a realistic image from scratch, which ai does quite well.

I don't have more than a fuzzy idea of how to implement this, but it seems to me that key frames _should_ be interchangeable with in between frames, so you want to train it so that if you start with key frames and generate in-between frames, and then run the in-between frames through the ai, it should regenerate the keyframes.

5 comments

> I think this should be _easier_ than producing a realistic image from scratch

Think of this in terms of constraints. An image from scratch has self consistency constraints (this part of the image has to be consistent with that part) and it may have semantic constraints (if it has to match a prompt). An animation also has the self consistency constraints, but also has to be consistent with other entire images! The fact that the images are close in some semantic space helps, but all the tiny details become so important to get precisely correct in a new way.

Like, if a model has some weird gap where it knows how to make an arm at 45 degrees and 60 degrees, but not 47, then that's fine for from-scratch generation. It'll just make one like it knows how (or more precisely, like it models as naturally likely). Same with any other weird quirks of what it thinks is good (naturally likely): It can just adjust to something that still matches the semantics but fits into the model's quirks. No such luck when now you need to get details like "47 degrees" correct. It's just a little harder without some training or modeling insight into how an arm at 45 degrees and 47 degrees are really "basically the same" (or just that much more data, so that you lose the weird bumps in the likelihood).

I wouldn't be surprised if "just that much more data" ends up being the answer, given the volume of video data on the internet, and the wide applicability of video generation (and hence intense research in the area).

It's counterintuitive but less so considering that it's way easier for a human to draw something from scratch than to inbetween 2 key frames as well!

(I guess we're used to machines and people struggling at opposite things so this is counter counter intuitive, or something...)

Animation key frames are not interchangeable with inbetween frames since the former try to show the most body parts in "extreme" positions though it's not always possible for all parts due to so called overlapping action. This is not to say you can't generate plausible "extremes" from inbetweens; acting wise key frames definitely have the most weight.

AI being good at stills is true, though it takes a lot of prompting and cherry picking quite often; most results I get out of naively prompting the most famous models are outright terrifying.

Animation is much lower framerate than live video, motion can be extremely exaggerated and the underlying shape can depend on the view, i.e. be non-euclidean. Additionally there are fewer high-frequency features (think leopard spots) that can be cues about how the global shape moves (leopard outline). And of course things are drawn by humans, not captured by cameras, which means animation errors will be pervasive throughout the training data.

These things combined mean less information to learn a more difficult world model.

I only scrolled through the article, reading snippets and looking at pictures, but the pictures of yoga moves were what caught my attention of "this is hard". Specifically, interpolating between a leg that's visible and extended, to a leg that is obscured/behind other limbs... it will be impressive/magical when the AI correctly distinguishes between possibilities like "this thing should fade/vanish", and "this thing should fold and move behind/be obscured other parts of the image".
Same, I would have thought that edge detection would have been among the first problems to get solved !