Hacker News new | ask | show | jobs
by null_shift 1382 days ago
True video capability would entail describing a scene as a prompt and getting a video in return. Not interpolating between a handful of images as is being done now (not to discredit those).

This will be a huge game changer when it occurs. Whether it be for deep fake videos, creating custom content, or making a new season of your favorite tv show that was cancelled too early. The possibilities are endless.

This is probably not in the near future (i.e. this year), but I doubt it is very far off.

6 comments

I am much more interested in an intermediary step. I would love to be able to use a tool like this to create a comic book. This is after all just static artwork which the tool already creates quite beautifully.

What it would need to be able to do to get from here to there is understand some concepts. The first being "characters". On reddit there was beautiful image that recently won first place in an art contest and its quite frustrated some of the art community. When I was looking at it I thought it was awesome, but wondered at the ability to create another hundred or so images in that same 'world' that the created image was showing. I would want to do something like give it the prompt "tired old medieval knight with a mace and shield" and have it create the character then be able to name it "Tom" or something and feed it more prompts for that characters like "Tom is sitting in a forest brooding" and have it create the same exact character but in a different context.

That would be pretty game changing for opening up amature web comics to a large body of people who have ideas and tell stories but have no art skills to speak of - my stick characters are crooked :(

> I would love to be able to use a tool like this to create a comic book.

Last week PhilFTW explained "How To Create a Complete Graphic Novel in ONE Day" with Midjourney in a YouTube video [1]. He uses five tools:

- Midjourney (to generate images)

- InferKit (to generate the story text)

- Word (to rearrange the story text to fit into some narrative)

- Comic Life 3 for iPad (to place the images and text in comic book panels)

- Affinity Designer (to design the cover and export everything to print, Kindle, and Blurb)

[1] https://youtu.be/tjj6KsPSHZc

The result is bad though, for the same reason you can't generate video with it. Comic panels need to relate to each other; you can't simply make them out of random images. There aren't sufficient style controls to do that with current technology, even if Midjourney added in "textual inversion".
There is some work exploring that with Textual Inversion[1].

Another trick to approach this problems is specifying the random seed, this will cause the same image being generated by the same prompt without any randomness. When you now change the prompt you get an image that is very similar to the first one, but with the variation included. Somebody used that to age a woman across 100 years[2] with quite stunning results. Even works with gender or style changes.

[1] https://textual-inversion.github.io/

[2] https://www.reddit.com/r/StableDiffusion/comments/wq6t5z/por...

[3] https://www.reddit.com/r/StableDiffusion/comments/wq6t5z/por...

I recently saw a Twitter thread from last year where someone made a comic book with AI generated backgrounds. The characters were added in later, but it stuck with me as a very cool future use case

https://twitter.com/ursulav/status/1467652391059214337

Imagine how fun sitting at a terminal in vim editing a 100 line 'script' for a short movie and getting rapid feedback back. I'm so excited about the future.
How about “hey Siri, play LOTR replacing every character with Nicolas Cage”
The possibilities are endless. "Insert Willie Wonka, as Froto's love interest, and Willie should joint the major battles with UZI machine guns, and his dialog should be as if he is an inner-city gang member."
"NOT THE ~BEES~ NAZGUL!"
Play 2001 A Space Odyssey, make it a tight 90 minutes, directed by Michael Bay.
Feedback will likely not be rapid.

It will take a lot of compute to compile the script and render the video.

I thought we'd never get image generation this fast. Last year it was 30 minutes per image. The stable diffusion folks are planning for a 100mb release of the image generator in Q1 which for sure would be real time. I actually suspect you can get something like that incredibly fast (even though all intuition says otherwise).
The article shows a model that does this.

It's only a few frames, but they are entirely generated from text - no seed image or interpolation required.

What is referred to/defined as "interpolation" because as an outsider... isn't "Stable Diffusion interpolating text into images/frames/video" in a "literal" (maybe not technical) sense?
It's to be interpreted in the quasi-mathematical sense where you have images for frame A and frame B representing your data points. To interpolate between those frames, a flow of plausible images simulating the transition from A to B is generated.
Interpolation here meaning one smooth motion transition is all that is depicted. An entire episode of television requires things like cuts between scenes, possibly discontinuities like flashbacks, scenes that take place days, months, or even decades later, and characters should still look the same, but might be wearing different clothing, or grow a beard, or get really old but still have similar facial features and the same skin color. If one ages, they should all age about the same, unless it's a story with time travel or humanoid immortal characters that don't age.

I'm sure these types of capabilities will come at some point, but no current model can do it. It requires more than just projecting motion into a scene.

You could "hack it" by using a couple of other models as part of your pipeline. Similarly to how you have to use GAN after SD to "fix" faces sometimes.

You also could put a language model on top of your prompting system. So "gandolff kicking ass" gets translated into " Page XXX, Paragraph XX from LOTR "

Cogvideo generates a video from a prompt, but you can also use an image as a start.
WikiSeries