Real artists take comic book scripts and turn them into actual comic books every month. They may not match exactly what the writer had in mind, but they are fit for purpose.
I haven't used SORA, but none of the GenAI I'm aware of could produce a competent comic book. When a human artist draws a character in a house in panel 1, they'll draw the same house in panel 2, not a procedurally generated different house for each image.
If a 60 year old grizzled detective is introduced in page 1, a human artist will draw the same grizzled detective in page 2, 3 and so on, not procedurally generate a new grizzled detective each time.
A human artist keeps state :). They keep it between drawing sessions, and more importantly, they keep very detailed state - their imagination or interpretation of what the thing (house, grizzled detective, etc.) is.
Most models people currently use don't keep state between invocations, and whatever interpretation they make from provided context (e.g. reference image, previous frame) is surface level and doesn't translate well to output. This is akin to giving each panel in a comic to a different artist, and also telling them to sketch it out by their gut, without any deep analysis of prior work. It's a big limitation, alright, but researchers and practitioners are actively working to overcome it.
Btw there’s a way to match characters in a batch in the forge webUI which guarantees that all images in the batch have the same figure in it. Trivial to implement this in all other image generators. This critique is baseless.
So prove it. If you are in good faith arguing an AI, via automation can draw a comic script with consistent figures, please tell an AI to draw the images in the first 3 pages of this script I pulled from the comic book script archive:
As long as you're not asking for a zero-shot solution with a single model run three times in a row, this should be entirely doable, though I imagine ensuring the result would require a complex pipeline consisting of:
- An LLM to inflate descriptions in the script to very detailed prompts (equivalent to artist thinking up how characters will look, how the scene is organized);
- A step to generate a representative drawing of every character via txt2img - or more likely, multiple ones, with a multimodal LLM rating adherence to the prompt;
- A step to generate a lot of variations of every character in different poses, using e.g. ControlNet or whatever is currently the SOTA solution used by the Stable Diffuison community to create consistent variations of a character;
- A step to bake all those character variations into a LoRA;
- Finally, scenes would be generated by another call to txt2img, with prompts computed in step 1, and appropriate LoRAs active (this can be handled through prompt too).
Then iterate on that, e.g. maybe additional img2img to force comic book style (with a different SD derivative, most likely), etc.
Point being, every subproblem of the task has many different solutions already developed, with new ones appearing every month - all that's left to have an "AI artist" capable of solving your challenge is to wire the building blocks up. For that, you need just a trivial bit of Python code using existing libraries (e.g. hooking up to ComfyUI), and guess what, GPT-4 and Claude 3.5 Sonnet are quite good at Python.
EDIT: I asked Claude to generate "pseudocode" diagram of the solution from our two comments:
I work with professional artists all the time and this is not the case. They're generally quite good at extrapolating from a couple paragraphs into something fantastic, often exactly what I had in mind.
In comparison I've messed around with prompting image generator models quite a bit and it's not possible to get remotely close to the quality level of even rough paid concept work by a professional, and the credits to run these models aren't particularly cheap.
With real art you can start from somewhere and keep building on that foundation. Say you pick an angle to shoot from and test different actors and scenes from that angle. With AI you’re re-rolling the dice for every iteration. If you’re happy that it looks 80% correct then sure it’s maybe passable.
I think people are getting way ahead of their skis here. Even in 2D I can’t for example generate inventory images for weapons and items for a game yet. Which is an orders of magnitude simpler test case than video. They all are slightly different styles. If I don’t care that they all look different in strange ways then it’s useful - but any consumer will think it looks like crap
There is no problem unless you insist on reflecting what you had in mind exactly. That needs minute controls, but no matter the medium and tools you use, unless you're doing it in your own quest for artistic perfection, the economic constraints will make you stop short of your idea - there's always a point past which any further refinement will not make a difference to the audience (which doesn't have access to the thing in your head to use as reference), and the costs of continuing will exceed any value (monetary or otherwise) you expect to get from the work.
AI or not, no one but you cares about the lower order bits of your idea.
Nobody else really cares about the lower order bits of the idea but they do care that those lower order bits are consistent. The simplest example is color grading: most viewers are generally ignorant of artistic choices in color palettes unless it’s noticeable like the Netflix blue tint but a movie where the scenes haven’t been made consistently color graded is obviously jarring and even an expensive production can come off amateur.
GenAI is great at filling in those lower order bits but until stuff like ControlNet gets much better precision and UX, I think genAI will be stuck in the uncanny valley because they’re inconsistent between scenes, frames, etc.
Yup, 100% agreed on that, and mentioned this caveat elsewhere. As you say - people don't pay attention to details (or lack of it), as long as the details are consistent. Inconsistencies stand out like sore thumbs. Which is why IMO it's best to have less details than to be inconsistent with them.
>There is no problem unless you insist on reflecting what you had in mind exactly.
Not disagreeing, just noting: this is not how [most?] people's minds work {I don't think you're holding to that opinion particularly, I'm just reflecting on this point}. We have vague ideas until an implementation is shown, then we examine it and latch on to a detail and decide if it matches our idea or not. For me, if I'm imagining "a superhero planting vegetables in his garden" I've no idea what they're actually wearing, but when an artist or genAI shows me it's a brown coat then I'll say "no something more marvel". Then when ultimately they show me something that matches the idea I had _and_ matches my current conception of the idea I had... then I'll point out the fingernails are too long, when in the idea I hadn't even perceived the person had fingers, never mind too-long fingernails!
I'd warrant any actualised artistic work has some delta with the artists current perception of the work; and a larger delta with their initial perception of it.
I disagree. Even without exactness, adding any reasonable constraints is impossible. Ask it to generate a realistic circuit diagram or chess board or any other thing where precision matters. Good luck going back and forth getting it right.
These are situations with relatively simple logical constraints, but an infinite number of valid solutions.
Keep in mind that we are not requiring any particular configuration of circuit diagram, just any diagram that makes sense. There are an infinite number of valid ones.
That's using the wrong tool for a job :). Asking diffusion models to give you a valid circuit diagram is like asking a painter to paint you pixel-perfect 300DPI image on a regular canvas, using their standard paintbrush. It ain't gonna work.
That doesn't mean it can't work with AI - it's that you may need to add something extra to the generative pipeline, something that can do circuit diagrams, and make the diffusion model supply style and extra noise (er, beautifying elements).
> Keep in mind that we are not requiring any particular configuration of circuit diagram, just any diagram that makes sense. There are an infinite number of valid ones.
On that note. I'm the kind of person that loves to freeze-frame movies to look at markings, labels, and computer screens, and one thing I learned is that humans fail at this task too. Most of the time the problems are big and obvious, ruining my suspension of disbelief, and importantly, they could be trivially solved if the producers grabbed a random STEM-interested intern and asked for advice. Alas, it seems they don't care.
This is just a specific instance of the general problem of "whatever you work with or are interested in, you'll see movies keep getting it wrong". Most of the time, it's somewhat defensible - e.g. most movies get guns wrong, but in way people are used to, and makes the scenes more streamlined and entertaining. But with labels, markings and computer screens, doing it right isn't any more expensive, nor would it make the movie any less entertaining. It seems that the people responsible don't know better or care.
Let's keep that in mind when comparing AI output to the "real deal", as to not set an impossible standards that human productions don't match, and never did.
The issue isn’t any particular constraint. The issue is the inability to add any constraints at all.
In particular, internal consistency is one of the important constraints which viewers will immediately notice. If you’re just using sora for 5 second unrelated videos it may be less of an issue but if you want to do anything interesting you’ll need the clips to tie together which requires internal consistency.