Hacker News new | ask | show | jobs
by londons_explore 1357 days ago
Indeed - it's not hard from a research point of view - it's hard from a compute perspective because adding one more dimension requires hundreds of times more compute.

Even then, these videos are only like 50 frames long - and a real movie you would want to be hundreds of thousands of frames long.

2 comments

So you need to optimise on compressed version, not the whole thing. What they’re doing right now is akin to a human trying to hold an entire picture - or entire movie - in their head all at once.

We can’t do it. AIs can sort of do it.

Latent diffusion models already demonstrated that operating on a compressed representation gives far better results, faster, but I don’t think we’re anywhere near the limit for what’s possible there. It’s no coincidence that this is how humans work.

I am curious if there has been any research on temporal attention in humans. I'm not sure how you'd quantify it. But in myself I know that I'm constantly predicting where something will be or what it will look like based on how it did a second ago. It's probably the root of reflexes.
Your comment reminded me of this video [1]

They put an eye tracker on someone and captured their motion when walking in some rough terrain. You can sort of see that the person is focusing on the most likely place their foot will go next.

[1] https://www.youtube.com/watch?v=ph6uUHq3a-g

I think that we will discover that there is a more efficient way to encode temporal relationships, which appears to be "just throw transformers at it." My guess is that it will be in a more conceptual latent space that this attention will be applied.

>and a real movie you would want to be hundreds of thousands of frames long.

Yes, but consider that most films are made up of many different shots, each of which are often just seconds long.

True, but the attention layers still need to be able to look at all the shots - for example to make sure the background of a room shown at the start of the movie is the same as the background of the same room at the end.

Obviously you could do 'human assisted' movie making where humans decide the storyboard and make directions for each shot, and then that isn't necessary.