Hacker News new | ask | show | jobs
by tmjdev 1357 days ago
This looks like the video equivalent of Dall-E 1. Hard to believe how far we've come so quickly.

The paper talks about "pseudo 3D attention layers" that are used in place of temporal attention layers for each dimension due to memory consumption. It seems like AI research is vastly outpacing GPU development.

4 comments

Indeed - it's not hard from a research point of view - it's hard from a compute perspective because adding one more dimension requires hundreds of times more compute.

Even then, these videos are only like 50 frames long - and a real movie you would want to be hundreds of thousands of frames long.

So you need to optimise on compressed version, not the whole thing. What they’re doing right now is akin to a human trying to hold an entire picture - or entire movie - in their head all at once.

We can’t do it. AIs can sort of do it.

Latent diffusion models already demonstrated that operating on a compressed representation gives far better results, faster, but I don’t think we’re anywhere near the limit for what’s possible there. It’s no coincidence that this is how humans work.

I am curious if there has been any research on temporal attention in humans. I'm not sure how you'd quantify it. But in myself I know that I'm constantly predicting where something will be or what it will look like based on how it did a second ago. It's probably the root of reflexes.
Your comment reminded me of this video [1]

They put an eye tracker on someone and captured their motion when walking in some rough terrain. You can sort of see that the person is focusing on the most likely place their foot will go next.

[1] https://www.youtube.com/watch?v=ph6uUHq3a-g

I think that we will discover that there is a more efficient way to encode temporal relationships, which appears to be "just throw transformers at it." My guess is that it will be in a more conceptual latent space that this attention will be applied.

>and a real movie you would want to be hundreds of thousands of frames long.

Yes, but consider that most films are made up of many different shots, each of which are often just seconds long.

True, but the attention layers still need to be able to look at all the shots - for example to make sure the background of a room shown at the start of the movie is the same as the background of the same room at the end.

Obviously you could do 'human assisted' movie making where humans decide the storyboard and make directions for each shot, and then that isn't necessary.

Hardware was probably always lagging behind cutting edge research, just consider video games, they pushed hardware limitations very hard since Pong.

It's a good thing to be fair, forcing research teams to optimize their projects is beneficial and creates a competition for limited resources. This gets a bit skewed when we consider a university research team vs. a MANGA type company, but the team behind Stable diffusion proved that innovation can come from unexpected places.

Looks a bit better than DALLE1 IMHO. They've demonstrated greater range.
i wonder how much vram these models cost ?