Hacker News new | ask | show | jobs
by Filligree 1364 days ago
So you need to optimise on compressed version, not the whole thing. What they’re doing right now is akin to a human trying to hold an entire picture - or entire movie - in their head all at once.

We can’t do it. AIs can sort of do it.

Latent diffusion models already demonstrated that operating on a compressed representation gives far better results, faster, but I don’t think we’re anywhere near the limit for what’s possible there. It’s no coincidence that this is how humans work.

1 comments

I am curious if there has been any research on temporal attention in humans. I'm not sure how you'd quantify it. But in myself I know that I'm constantly predicting where something will be or what it will look like based on how it did a second ago. It's probably the root of reflexes.
Your comment reminded me of this video [1]

They put an eye tracker on someone and captured their motion when walking in some rough terrain. You can sort of see that the person is focusing on the most likely place their foot will go next.

[1] https://www.youtube.com/watch?v=ph6uUHq3a-g

I think that we will discover that there is a more efficient way to encode temporal relationships, which appears to be "just throw transformers at it." My guess is that it will be in a more conceptual latent space that this attention will be applied.