| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jerf 562 days ago

It just plain isn't possible if you mean a prompt the size of what most people have been using lately, in the couple hundred character range. By sheer information theory, the number of possible interpretations of "a zoom in on a happy dog catching a frisbee" means that you can not match a particular clip out of the set with just that much text. You will need vastly more content; information about the breed, information about the frisbee, information about the background, information about timing, information about framing, information about lighting, and so on and so forth. Right now the AIs can't do that, which is to say, even if you sit there and type a prompt containing all that information, it is going to be forced to ignore most of the result. Under the hood, with the way the text is turned into vector embeddings, it's fairly questionable whether you'd agree that it can even represent such a thing.

This isn't a matter of human-level AI or superhuman-level AI; it's just straight up impossible. If you want the information to match, it has to be provided. If it isn't there, an AI can fill in the gaps with "something" that will make the scene work, but expecting it to fill in the gaps the way you "want" even though you gave it no indication of what that is is expecting literal magic.

Long term, you'll never have a coherent movie produced by stringing together a series of textual snippets because, again, that's just impossible. Some sort of long-form "write me a horror movie staring a precocious 22-year old elf in a far-future Ganymede colony with a message about the importance of friendship" AI that generates a coherent movie of many scenes will have to be doing a lot of some sort of internal communication in an internal language to hold the result together between scenes, because what it takes to hold stuff coherent between scenes is an amount of English text not entirely dissimilar in size from the underlying representation itself. You might as well skip the English middleman and go straight to an embedding not constrained by a human language mapping.

11 comments

LASR 562 days ago

What you are saying is totally correct.

And this applies to language / code outputs as well.

The number of times I’ve had engineers at my company type out 5 sentences and then expect a complete react webapp.

But what I’ve found in practice is using LLMs to generate the prompt with low-effort human input (eg: thumbs up/down, multiple-choice etc) is quite useful. It generates walls of text, but with metaprompting, that’s kind of the point. With this, I’ve definitely been able to get high ROI out of LLMs. I suspect the same would work for vision output.

kurthr 562 days ago

I'm not sure, but I think you're saying what I'm thinking.

Stick the video you want to replicate into -o1 and ask for a descriptive prompt to generate a video with the same style and content. Take that prompt and put it into Sora. Iterate with human and o1 generated critical responses.

I suspect you can get close pretty quickly, but I don't know the cost. I'm also suspicious that they might have put in "safeguards" to prevent some high profile/embarrassing rip-offs.

robotresearcher 562 days ago

> Long term, you'll never have a coherent movie produced by stringing together a series of textual snippets because, again, that's just impossible.

Why snippets? Submit a whole script the way a writer delivers a movie to a director. The (automated) director/DP/editor could maintain internal visual coherence, while the script drives the story coherence.

coffeebeqn 562 days ago

This almost certainly won’t work. Feel free to feed any of the hundreds of existing film scripts and test how coherent the models can be. My guess is not at all

robotresearcher 562 days ago

The clips on the Sora site today would have been utterly astonishing ten years ago. Long term progress can be surprising.

dragonwriter 562 days ago

> The clips on the Sora site today would have been utterly astonishing ten years ago.

Yeah, and Apollo 11 would have been utterly astonishing a decade before it occurred. And, yet, if you tried to project out from it to what further frontiers manned spaceflight would reach in the following decades, you’d…probably grossly overestimate what actually occurred.

> Long term progress can be surprising.

Sure, it can be surprising for optimists as well as naysayers; as a good rule of thumb, every curve that looks exponential in an early phase ends up being at best logistic.

hatefulmoron 562 days ago

In the long run we are all dead. Saying that technology will be better in the future is almost eye-roll worthy. The real task is predicting what future technology will be, and when it will arrive.

Ask anyone with a chronic illness about the future and they'll tell you we're about 5 years off a cure. They've been saying that for decades. Who knows where the future advancements will be.

bergen 562 days ago

https://xkcd.com/605/

sleepybrett 562 days ago

This will almost certainly be in theaters within 5 years, probably first as a small experimental project (think blair witch).

runarberg 562 days ago

The Blair Witch Project was a (surprise) creative masterpiece. It worked with very limited technology to create a very clever plot which was paired with an amazing marketing. The combination of which the world hadn’t seen before. It took some creative geniuses to peace the Blair Witch Project together.

Generative AI will never produce an experience like that. I know never is a long time, but I’m still gonna call it. You simply can’t produce such a fresh idea by gathering a bunch of data and interpolating.

Maybe someday enough AI will be good enough to create shorter or longer videos with some dialog and even a coherent story (though I doubt it), but it won‘t be fresh or creative. And we humans will at best enjoy it for its stupidity or sloppiness. Not for its cleverness or artistry.

dumbfounder 562 days ago

Why does the idea need to be generated by AI? Let people generate the ideas, the AI will help execute. I think soon (3-5 years) a determined person with no video skills will be able to put together a compelling movie (maybe a short). And that is massive. AI doesn’t have to do everything. Like all tech, it’s a productivity tool.

krainboltgreene 562 days ago

> Why does the idea need to be generated by AI?

This is the at-first-fun-but-now-frustrating infinite goal move. "AI (a stand in for literally anything) will do (anything) soon." -> "It won't do (thing), it's too complex." -> "Who said AI will do (thing)?"

Breza 558 days ago

I'm suspicious of most claims of AI growth, but I think screenwriting is an area where there's real potential. There are many screenplays out there, many movie plots are very similar to each other, and human raters could help with training. And it's worth noting that the top four highest grossing movies right now are all sequels or film adaptations. It's not a huge leap to imagine an LLM in the future that's been trained on movie writing being able to create a movie script when given the Wicked musical. https://www.imdb.com/chart/boxoffice/

runarberg 558 days ago

The 2023 Writers Guild of America strike was in part to prevent screenplays being written entirely by generative AI.

So no I don’t think this will happen either. Authors may use use AI them selves as one tool in their tool box as they write their script, but we will not see entire production screen plays being written by generative AI set for theatrical release. The industry will simply not allow that to happen. At most you can have AI write a screen play for your own amusement, not for publication.

sleepybrett 560 days ago

I'm thinking more of a Gibsonian 'Garage Kubrick'. A solitary auteur (or small team) that produces the film alone perhaps without even touching a camera, generating all the footage using AI (in the novel the auteur creates all the footage through photo/found-footage manipulation, or at least thats all we see in text). The script will probably be human written, I'm not talking about an AI producing a film from scratch, rather a film being produced using AI to create all the visuals and audio.

runarberg 560 days ago

That is a far more reasonable prediction but I don’t even see this future. This kind of “film making” will at best be something generated for the amusement of the creator (think, give me a specific episode of Star Trek where Picard ...) or as prototypes or concepts of yet to be filmed with actual actors. And it certainly won’t be in theaters, not in 5 years, or ever.

Generative AI will not be able to approach the artistry of your average actor (not even a bad actor), it won’t be able match the lighting or the score to the mood (unless you carefully craft that in your prompt). It won‘t get creative with the camera angles (again unless you specifically prompt for a specific angle) or the cuts. And it probably won’t stay consistent with any of these, or otherwise break the consistency at the right moments, like an artist could.

If you manage to prompt the generative AI to create a full feature film with excellent acting, the correct lighting given the mood, a consistent tone with editing to match, etc. you have probably spent much more time and money into crafting the prompt than would otherwise have gone into simply hiring the crew to create your movie. The AI movie will certainly contain slop and be visibly so bad it guaranteed will not be in theaters.

Now if you hired that crew to make the movie instead, that crew might use AI as a tool to enhance their artistry, but you still need your specialized artists to use that tool correctly. That movie might make it to the theaters.

SamPatt 562 days ago

It's a tool. The cleverness and artistry comes from the humans, not from the tools they use.

The AI isn't creating the fresh ideas. People are.

runarberg 561 days ago

So what you are saying is some aspects of movie making will use AI as parts of their jobs. That is very realistic and probably already happening.

Saying that large video models will be in theaters sounds like a completely different and much more ambitious prediction. I interpreted it as if large video models will produce whole movies on their own from a script of prompts. That there will be a single film maker with only a large video model and some prompts to make the movie. Such films will never be in the theater, unless by some grifter, and than it is certain to be a flop.

troupo 562 days ago

You should watch how movies are made sometime. How a script is developed. How changes to it are made. How storyboards are created. How actors are screened for roles. How locations are scouted, booked, and changed. How the gazillion of different departments end up affecting how a movie looks, is produced, made, and in which direction it goes (the wardrobe alone, and its availability and deadlines will have a huge impact on the movie).

What does "EXT. NIGHT" mean in a script? Is it cloudy? Rainy? Well lit? What are camera locations? Is the scene important for the context of the movie? What are characters wearing? What are they looking at?

What do actors actually do? How do they actually behave?

Here are a few examples of script vs. screen.

Here's a well described script of Whiplash. Tell me the one hundred million things happening on screen that are not in the script: https://www.youtube.com/watch?v=kunUvYIJtHM

Or here's Joker interrogation from The Dark Night Rises. Same million different things, including actors (or the director) ignoring instructions in the script: https://www.youtube.com/watch?v=rqQdEh0hUsc

Here's A Few Good Men: https://www.youtube.com/watch?v=6hv7U7XhDdI&list=PLxtbRuSKCC...

and so on

---

Edit. Here's Annie Atkins on visual design in movies, including Grand Budapest Hotel: https://www.youtube.com/watch?v=SzGvEYSzHf4. And here's a small article summarizing some of it: https://www.itsnicethat.com/articles/annie-atkins-grand-buda...

Good luck finding any of these details in any of the scripts. See minute 14:16 where she goes through the script

Edit 2: do watch The Kerning chapter at 22:35 to see what it actually takes to create something :)

shermantanktop 562 days ago

I can't upvote this enough. This topic in the media space has generated a huge amount of naive speculation that amounts to "how hard could it be to do <thing i know nothing about>?"

FranzFerdiNaN 562 days ago

> "how hard could it be to do <thing i know nothing about>?"

This is most Hacker News comments summarized lmao. It's kinda my favorite thing of this place: just open any thread and you immediately see so many people rushing to say ''well just do X or Y'' or ''actually it's X or Y and not Z like the experts claim''. Love it.

shermantanktop 562 days ago

In this case, it’s movies and TV, which most people enjoy. So there’s a superficial accessibility to the problem which encourages this attitude.

Of course, HN being the place that it is, the same type of comments are made about quantum entanglement and solar panel efficiency.

bunabhucan 562 days ago

I agree with you.

At the same time I am curious in the "that person has too many fingers" sense at what a system trained on tens of thousands of movies plus scripts plus subtitles plus metadata etc. would generate.

I thought about it for a bit and I would want to watch a computer generated Sharknado 7 or Hallmark Christmas movie.

robotresearcher 562 days ago

Of course normally other people contribute to a movie after the writer. My comment mentioned three of the important roles. This whole thread is about tech that automates away those roles. That's the whole point.

dbspin 562 days ago

I think you've misunderstood the objection.

Lets pick something concrete. It's a medieval script, it opens with two knights fighting. OK so later in the script we learn their characters, historic counterparts etc. So your LLM can match nefarious villain to some kind of embedding, and doubtless has trained on countless images of a knight.

But the result is not naively going to understand the level of reality the script is going for - how closely to stick to historic parallels, how much to go fantastical with the depiction. The way we light and shoot the fight and how it coheres with the themes of the scene, the way we're supposed to understand the characters in the context of the scene and the overall story, the references the scene may be making to the genre or even specific other films etc.

This is just barely scraping the surface of the beginnings of thinking about mise en scene, blocking, framing etc. You can't skip these parts - and they're just as much of a challenge as temporal coherence, or performance generation or any of the other hard 'technical issues' that these models have shown no capacity to solve. They're decisions that have to be made to make a film coherent at all - not yet good or tasteful or creative or whatever.

Put another way - you'd need AGI to comprehend a script at the level of depth required to do the job of any HOD on any film. Such a thing is doubtless possible, but it's not going to be shortcut naively the way generation an image is - because it requires understanding in context, precisely what LLMs lack.

robotresearcher 562 days ago

> but the result is not naively going to understand the level of reality the script is going for…

We can already get detailed style guidance into picture generation. Declaring you want Picasso cubist, Warner brothers cartoon, or hyper realistic works today. So does lighting instructions, color palettes, on and on.

These future models will not be large language models, they will be multi-modal. Large movie models if you like. They will have tons of context about how scenes within movies cohere, just as LLMs do within documents today.

troupo 562 days ago

So, we went from "just hand off movie script to automated director/DP/editor" we're now rapidly approaching:

- you have to provide correct detailed instructions on lighting

- you have to provide correct detailed instructions on props

- you have to provide correct detailed instructions on clothing

- you have to provide correct detailed instructions on camera position and movement

- you have to provide correct detailed instructions on blocking

- you have to provide correct detailed instructions on editing

- you have to provide correct detailed instructions on music

- you have to provide correct detailed instructions on sound effects

- you have to provide correct detailed instructions on...

- ...

- repeat that for literally every single scene in the movie (up to 200 in extreme cases)

There's a reason I provided a few links for you to look at. I highly recommend the talk by Annie Atkins. Watch it, then open any movie script, and try to find any of the things she is talking about there (you can find actual movie scripts here: https://imsdb.com)

krainboltgreene 562 days ago

This is such an incredibly confident comment. I'm in awe.

letmevoteplease 562 days ago

Shane Carruth (Primer) released interesting scripts for "A Topiary" and "The Modern Ocean" which now have no hope of being filmed. I hope AI can bring them to life someday. If we get tools like ControlNet for video, maybe Carruth could even "direct" them himself.

spoaceman7777 562 days ago

This exists already actually. Kling AI 1.5. Saw the demo on twitter two days ago, which shows a photo-to-video transformation on an image of three women standing on a beach, and the video transformation simulates the camera rotating, with the women moving naturally. Just involves a segment-anything style selection of the women, and drawing a basic movement vector.

https://x.com/minchoi/status/1862975323433795726

Der_Einzige 562 days ago

Controlnet for video is just controlnet but ran frame by frame resulting in AI Rotoscoping.

bwfan123 562 days ago

brilliant take from Ben Affleck on ai in movies..

"movies will be one of the last things to be replaced by ai"

https://www.youtube.com/watch?v=ypURoMU3P3U

including this quote: "being a craftsman is knowing how to work, art is knowing when to stop"

rossjudson 562 days ago

It is absolutely true that LLMs do not know when to stop.

natmaka 562 days ago

An adequate prompter (human at the prompt) knows when to stop.

jerf 562 days ago

That's what I describe at the end, albeit quickly in lingo, where the internal coherence is maintained in internal embeddings that are never related to English at all. A top-level AI could orchestrate component AIs through embedded vectors, but you'll never do it with a human trying to type out descriptions.

minimaxir 562 days ago

> Under the hood, with the way the text is turned into vector embeddings, it's fairly questionable whether you'd agree that it can even represent such a thing.

The text encoder may not be able to know complex relationships, but the generative image/video models that are conditioned on said text embeddings absolutely can.

Flux, for example, uses the very old T5 model for text encoding, but image generations from it can (loosely) adhere to all rules and nuances in a multi-paragraph prompt: https://x.com/minimaxir/status/1820512770351411268

dragonwriter 562 days ago

> but image generations from it can (loosely) adhere to all rules and nuances in a multi-paragraph prompt

Flux certainly does not consistently do so across an arbitrary collection of multi-paragraph prompts, as anyone whose run more than a few long prompts past it would recongize; also, the tweet is wrong in the other direction, as well, longer language-model-preprocessed prompts for models that use CLIP (like various SD1.5 and SDXL derivatives) are, in fact, a common and useful technique. (You’d kind of think that the fact that generated prompt here is significantly longer than the 256 token window of T5 would be a clue that the 77 token limit of CLIP might not be as big of a constraint as the tweet was selling it as, too.)

lmm 562 days ago

> You might as well skip the English middleman and go straight to an embedding not constrained by a human language mapping.

How would you ever tweak or debug it in that case? It doesn't strictly have to be English, but some kind of human-readable representation of the intermediate stages will be vital.

amelius 562 days ago

Can't you just give it a photo of a dog, and then say "use this dog in this or that scene"?

artemisart 562 days ago

Yes, the idea works and was explored with dreambooth/textual inversion for image diffusion models.

https://dreambooth.github.io/ https://textual-inversion.github.io/

minimaxir 562 days ago

Both of those are of course out of date and require significant training instead of just feeding it a single image.

InstantID (https://replicate.com/zsxkib/instant-id) fixes that issue.

Auracle 561 days ago

Dreambooth style training is in no way out of date.

If you just want a face, InstandID/Pulid work - but it’s not going to be very varied. Doing actual training means you can get any perspective, lighting, style, expression, etc - and have the whole body be accurate.

alpha_squared 562 days ago

How would that even work? A dog has physical features (legs, nose, eyes, ears, etc.) that they use to interact with the world around them (ground, tree, grass, sounds, etc.). And each one of those things has physical structures that compose senses (nervous system, optic nerves, etc.). There are layers upon layers of intricate complexity that took eons to develop and a single photo cannot encapsulate that level of complexity and density of information. Even a 3D scan can't capture that level of information. There is an implicit understanding of the physical world that helps us make sense of images. For example, a dog with all four paws standing on grass is within the bounds of possibility; a dog with six paws, two of which are on it's head, are outside the bounds of possibility. An image generator doesn't understand that obvious delineation and just approximates likelihood.

int_19h 562 days ago

A single photo doesn't have to capture all that complexity. It's carried by all those countless dog photos and videos in the training set of the model.

krainboltgreene 562 days ago

Actually, it does have to capture all of that complexity because it's a photon-based analysis of reality. You cannot take a photo without doing that.

fennecbutt 554 days ago

This is correct and even image generation models aren't really trained for comprehension of image composition yet.

Even the models based off danbooru and E621 still aren't the best at that. And us furries like to tag art in detail.

The best we can really do at the moment is regional prompting, perhaps they need something similar for video.

echelon 562 days ago

For those not in this space, Sora is essentially dead on arrival.

Sora performs worse than closed source Kling and Hailuo, but more importantly, it's already trumped by open source too.

Tencent is releasing a fully open source Hunyuan model [1] that is better than all of the SOTA closed source models. Lightricks has their open source LTX model and Genmo is pushing Mochi as open source. Black Forest Labs is working on video too.

Sora will fall into the same pit that Dall-E did. SaaS doesn't work for artists, and open source always trumps closed source models.

Artists want to fine tune their models, add them to ComfyUI workflows, and use ControlNets to precision control the outputs.

Images are now almost 100% Flux and Stable Diffusion, and video will soon be 100% Hunyuan and LTX.

Sora doesn't have much market apart from name recognition at this point. It's just another inflexible closed source model like Runway or Pika. Open source has caught up with state of the art and is pushing past it.

[1] https://github.com/Tencent/HunyuanVideo

circlefavshape 562 days ago

Their online version is all in Chinese (or at least some Chinese-looking script I don't understand) ... and they recommend an 80GB GPU to run the thing, which costs ~€15-18k. Yikes, guess I won't be doing this at home anytime soon

yeknoda 562 days ago

something like a white paper with a mood board, color scheme, and concept art as the input might work. This could be sent into an LLM "expander" that increases the words and speficity. Then multiple reviews to tap things in the right direction.

mikepurvis 562 days ago

I expect this kind of thing is actually how it's going to work longer term, where AI is a copilot to a human artist. The human artist does storyboarding, sketching in backdrops and character poses in keyframes, and then the AI steps in and "paints" the details over top of it, perhaps based on some pre-training about what the characters and settings are so that there's consistency throughout a given work.

The real trick is that the AI needs to be able to participate in iteration cycles, where the human can say "okay this is all mostly good, but I've circled some areas that don't look quite right and described what needs to be different about them." As far as I've played with it, current AIs aren't very good at revisiting their own work— you're basically just tweaking the original inputs and otherwise starting over from scratch each time.

programd 562 days ago

We will shortly have much better tweaking tools which work not only on images and video but concepts like what aspects a character should exhibit. See for example the presentation from Shapeshift Labs.

https://www.shapeshift.ink/

3form 562 days ago

And I think this realistically is going to be the shape of the tools to come in the foreseeable future.

echelon 562 days ago

You should see what people are building with Open Source video models like HunYuan [1] and ComfyUI + Control Nets. It blows Sora out of the water.

Check out the Banodoco Discord community [2]. These are the people pioneering steerable AI video, and it's all being built on top of open source.

[1] https://github.com/Tencent/HunyuanVideo

[2] https://banodoco.ai/

prmoustache 562 days ago

The whole point of AI stuff is not to produce exactly what you have in mind, but what you are describing. Same with text, code, images, video...

szundi 562 days ago

Sounds like we achieved 50% of AI then. The artifical is there, now we need the intelligence part.

baq 562 days ago

Sora should be evaluated on xkcd strips as inputs.