| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 6gvONxR4sf7o 796 days ago

> I think this should be _easier_ than producing a realistic image from scratch

Think of this in terms of constraints. An image from scratch has self consistency constraints (this part of the image has to be consistent with that part) and it may have semantic constraints (if it has to match a prompt). An animation also has the self consistency constraints, but also has to be consistent with other entire images! The fact that the images are close in some semantic space helps, but all the tiny details become so important to get precisely correct in a new way.

Like, if a model has some weird gap where it knows how to make an arm at 45 degrees and 60 degrees, but not 47, then that's fine for from-scratch generation. It'll just make one like it knows how (or more precisely, like it models as naturally likely). Same with any other weird quirks of what it thinks is good (naturally likely): It can just adjust to something that still matches the semantics but fits into the model's quirks. No such luck when now you need to get details like "47 degrees" correct. It's just a little harder without some training or modeling insight into how an arm at 45 degrees and 47 degrees are really "basically the same" (or just that much more data, so that you lose the weird bumps in the likelihood).

I wouldn't be surprised if "just that much more data" ends up being the answer, given the volume of video data on the internet, and the wide applicability of video generation (and hence intense research in the area).