| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pauloday 591 days ago

Someone wrote the following comment then deleted it. I spent 30 minutes on my response and wanted to post it anyway. Apologies if the original comment was deleted by a mod, I hope this is OK to post.

---QUOTE---

My "test" for video generation turning movie making on its head is when a model can add the missing Tom Bombadil chapters to Peter Jackson's LOTR movies.

Probably 20 - 30 minutes of HD, aesthetically synced, scripted etc with minimal editing after a detailed prompt and source material.

Qualifier - the AI just has to follow the book script, third party tools ok to use for lip syncing and audio :)

I said 5 years away last year.

Feels like it might be more like 1 - 2 years.

What do you think?

---END QUOTE---

My response:

I think we're getting into diminishing returns territory with this AI stuff. These video/image generators are impressive but they don't "understand" physical reality and probably never will without a breakthrough. You can see this in the demo videos, the best looking ones are glorified still images and the worst are whenever something physical happens, like the lemon being picked up or the guy eating cereal. These examples may get better, but I really doubt they'll ever look like real unaltered camera footage without adding an understanding of how our physical reality works into the model somehow.

For the script generation, Fellowship of the Ring is not a movie script and requires serious interpretation and planning to be converted into one. Especially if you want it to fit into Jackson's films at all. If nothing else the dialog and frequency of songs/poetry are very different. The current text generators aren't really capable of that kind of planning yet, but I wouldn't be surprised if there's a screenwritten treatment of that chapter floating around on the internet somewhere, or at least bits of one. It has certainly ingested The Fellowship of the Rings, and plenty of screenplays plus the books they were based on. So maybe chatgpt can make a convincing script. I asked the free version and got some dialog that seems fine, but absolutely no scene direction at all. I'm willing to believe that was either an issue with my prompting or something that can be fixed in 5 years. So at least the script may be possible.

As for converting it into an actual piece of film, I don't think that's currently possible without a breakthrough on planning. There's a reason these video demos aren't usually very long, it's because they aren't good at scene changes. People's faces change, rooms change shape, etc. Maybe that can be fixed through engineering, but film editing is hard. It's not easy to plan and chain together shots in a way that gives a proper sense of physical reality while conveying everything a scene needs to.

Take a look at Dan Olsen's video analyzing the editing of Suicide Squad[1]. That movie was edited by a trailerhouse and it shows. A big issue is that the scenes and shots don't flow together very well - it's edited like a bunch of separate shots and scenes rather than a coherent whole. As a result it's generally considered one of the worst films big budget ever made. And from my (admittedly limited) understanding/playing around with these generators, they aren't even remotely close to being able to do the type of planning needed to pull that kind of editing off, much less something on the level of Jackson's adaptation. Again I could be wrong but it really seems like another "Attention is All You Need" level breakthrough to get there.

So I'd say no, I don't think we'll get what you describe, at least not at any level of quality, in 1-2 years. 5 years sounds more realistic but I really believe we'd need another huge breakthrough to get there, and those are hard to come by. Assuming one will happen in any given time period seems foolish. But a lot of smart people are working on that, so maybe we'll get it. But I don't think we'll even get there in 10 years with just engineering improvements on the current stuff. Scientific progress isn't linear.

Yours and a lot of other predictions about AI stuff really remind me of how all the futurists in the 50's thought we'd be able to freeze and unfreeze humans in a few short years. They thought that because it's actually really easy to do that with hamsters, but it turns out scaling the process up isn't so easy (Tom Scott has a good video tangentially related to this[2]). I think a lot of people are standing near the top of the steep part of a sigmoid curve and saying "Wow look how far we've come in just 3 years! The next 3 years are going to be insane!" When in reality we just have a long plateau of minor improvements in front of us. But who knows, maybe that next breakthrough is right around the corner.

[1]: https://www.youtube.com/watch?v=mDclQowcE9I [2]: https://www.youtube.com/watch?v=2tdiKTSdE9Y

3 comments

thrdbndndn 591 days ago

> getting into diminishing returns

The same applies to image generation. Asking AI to create something based on a rough idea is straightforward and can yield amazing results at times. However, fine-tuning the details of an image is incredibly challenging without manual intervention—especially for aspects that are intuitive to humans but lack sufficient representation in the AI's training data.

Honestly, I'd say even text generation, whether it's coding or copywriting—arguably what generative AI excels at—often hits this same limitation.

link

nopinsight 591 days ago

Have you seen Veo 2 just launched by Google? Its quality and physics understanding appear far ahead of the competition.

https://deepmind.google/technologies/veo/veo-2/

Also, planning might be around the corner with test-time compute applied to video generation.

link

wcarss 590 days ago

There's also Genie 2:

https://deepmind.google/discover/blog/genie-2-a-large-scale-...

this one's entirely about world understanding with physical concepts etc. and less about photorealism, but it's really not hard to imagine a pipeline combining these

link

magic_hamster 590 days ago

> These video/image generators are impressive but they don't "understand" physical reality and probably never will without a breakthrough

It turns out some generative models are good enough at simulating physics that they can replace actual simulators for a fraction of the cost. Can't find the link right now, but in the excellent "two minutes papers" channel there were quite a few examples of this. In particular I remember a weather, or cloud simulation which was replicated with gen AI.

link