Hacker News new | ask | show | jobs
Self-Supervised Learning for Videos (lightly.ai)
91 points by sauravmaheshkar 673 days ago
5 comments

my hypothesis:

using a video captured as 3D is going to vastly improve the learning of representations--with the additional benefit of depth perception that allows humans/neural nets to predict how their projections onto a 2D plane are supposed to look like as they move

say the videos are captured using a pair of identical cameras on a phone -- which I have been waiting for a while to see as a feature on flagship phones and mass adopted

such mass adoption would ensure there is vast amounts of traing data from all kinds of situations to learn everything about the visual world and its physics

now pair it with other sensors like audio, temperature, weather, chemicals, etc.

the model can learn to associate a boom with a flying jet, rumble with dark rolling clouds, and petrichor with rain on hot sand

we can slowly start to model more and more of human experience in a single model as computing power grows

I think that is similar to what Yann LeCun outlined: https://bdtechtalks.com/2022/03/07/yann-lecun-ai-self-superv...
Rather than doing self-supervised learning on the actual video frames, why not do it on the byte sequence that represents the video file?
You might find this paper interesting: [JPEG-LM: LLMs as Image Generators with Canonical Codec Representations](https://arxiv.org/abs/2408.08459)
Thanks. This is exactly the kind of thing I was looking for.
Nice work!
Very cool!
Cool!