https://twitter.com/BenMildenhall/status/1758224827788468722
https://twitter.com/ScottieFoxTTV/status/1758272455603327455
The key is that the video has spatial consistency. Once you've got that, then other existing tech can take the output and infer actual spatial forms.