Hacker News new | ask | show | jobs
by zdyn5 719 days ago
I know it’s probably using < 1000x compute of the real Sora, but “pretty good” is stretching it
2 comments

Depends on your frame of reference. Compared to anything else I've seen generated on a consumer grade GPU, I'd say these are are indeed pretty good.

Here's their example gallery: https://hpcaitech.github.io/Open-Sora/

Compared to the outputs from other models run on consumer grade GPUs, I'd say those are very good.

Looking at it in low res (small rectangle within the webpage on mobile) they actually look great!
What's the useful frame of reference?

Looking better than other things that are also bad is sort interesting in that it represents progress in some direction, but it isn't very interesting to people outside of the topic.

For me I use the Will Smith video[0] from just over a year ago. Compared to the examples it's a pretty stark difference.

https://arstechnica.com/information-technology/2023/03/yes-v...

Yes, people learned not to generate other people eating. Current SOTA models still have no concept of walking (left leg, right leg, left leg, right leg; it's so complicated?), there is no reason to believe that they have learned the peculiarities of food consumption.
We seem to be in an exponential uptick phase of tech driven by hardware improvements; a few years ago this was impossible on consumer grade GPU. So in some sense there isn't a useful frame of reference, state of the art should improve out-of-sight about every 2 years and eventually I'd expect iPhones to be outgenerating Disney at movies.
Not GP, but when I looked at the examples, I thought that those already look pretty useable in comic book-like storytelling to set the mood. I.e. in settings where smaller details of the scene are not relevant and are not taking away from the "larger product".
Good that this frame of reference is hn and not some random website where people have no connection to ml...
Just run all key frames through stable diffusion and it should be quite good.
I don't think there's such a thing as key frames here, just frames. And if you run SD through every frame, the output will be janky because SD doesn't know about temporal coherence.
As soon as it's an MP4 it will have key frames all right. You could add AI upscaling to your encoder. People are making fun of "Just", but I believe I could take apart ffmpeg to add this feature (PoC) in two weeks or less. Provide somebody pays for my labor and for the HW.
Honestly neither does OpenSora it seems, as it is pretty damn janky already.
You can pass a few frames as a single image grid. Then you will get coherence, although it will be very limited by gpu ram.
>Just

Adding the word 'just' doesnt make it any easier. Something I've noticed is that people who have never done something themselves and are telling someone to do an difficult task, will use:

"Just"

in-front of it.

This is particularly relevant in tech.

Depends if it's already been done before, in which case "just" would then have been just used quite justly.
It's very likely the comment you replied to said it in a joking sense
It is extremely difficult to tell if the person is joking in a field full of people who think AI is some sort of magic.
Well I mean calling any of this diffusion/LLM stuff "AI" is a misnomer to begin with.