| In my day job I program rigid body behaviour in real time amongst other simulations.
I think rigid body contact is hard to learn as it is inherently discontinuous.. something you discover when trying to code a solver. As such I always use this prompt as a test:
"A video of a jenga brick tower falling over as a brick is removed. The physics of each brick must be realistic." It gave me a video of where bricks suddenly disapper or morph into others[1]. The linked video is after 2-3 iterations of me insisting on realistic physics. If you are just glancing at this, you would believe it is realistic. That said this is still very impressive and one more step towards .. IDK what. But I am a bit reasurred that at least my job won't be fully replaced with AI :) [1] https://streamable.com/2em1r3 |
I honestly can't comment with certainty that training from videos alone and whatever tokenization scheme they're using will ever get perfect dynamics.
However it is worth noting that transformers can do a pretty good job at learning dynamics with the right pipeline (not video): https://arxiv.org/pdf/2605.15305 https://arxiv.org/pdf/2605.09196
My point here being that representationally, it might be possible to learn good dynamics without a radically different approach/arch. There are already models that extract 3D tracking points from videos, so they could possibly be leveraged for learning dynamics (which on its own gives precedent for end-to-end approaches also possibly working).