Looking better than other things that are also bad is sort interesting in that it represents progress in some direction, but it isn't very interesting to people outside of the topic.
Yes, people learned not to generate other people eating. Current SOTA models still have no concept of walking (left leg, right leg, left leg, right leg; it's so complicated?), there is no reason to believe that they have learned the peculiarities of food consumption.
We seem to be in an exponential uptick phase of tech driven by hardware improvements; a few years ago this was impossible on consumer grade GPU. So in some sense there isn't a useful frame of reference, state of the art should improve out-of-sight about every 2 years and eventually I'd expect iPhones to be outgenerating Disney at movies.
Not GP, but when I looked at the examples, I thought that those already look pretty useable in comic book-like storytelling to set the mood. I.e. in settings where smaller details of the scene are not relevant and are not taking away from the "larger product".
I don't think there's such a thing as key frames here, just frames. And if you run SD through every frame, the output will be janky because SD doesn't know about temporal coherence.
As soon as it's an MP4 it will have key frames all right. You could add AI upscaling to your encoder. People are making fun of "Just", but I believe I could take apart ffmpeg to add this feature (PoC) in two weeks or less. Provide somebody pays for my labor and for the HW.
Adding the word 'just' doesnt make it any easier. Something I've noticed is that people who have never done something themselves and are telling someone to do an difficult task, will use:
Here's their example gallery: https://hpcaitech.github.io/Open-Sora/
Compared to the outputs from other models run on consumer grade GPUs, I'd say those are very good.