First of all, the title should include "video of a predefined 360° turn".
And then they say something along the lines of "average accuracy of about 5mm" for joining the constructed modeled joints to their model, while you see the body wobbling around happily.
99%¹ of computer vision problems are 80% solved. The problem is, you need 95+% solution to be practically useful.
Binocular stereo vision has just approached general applicability, and SfM is mostly used in very constrained environments (traffic analysis) or with large computational resources with manual correction (offline 3D mapping from aerial data).
¹ Numbers are metaphoric only, based on experience in scientific and industrial CV.
SFM does not automatically provide joint locations. Also, a casual 360 video around a subject does not provide enough data for producing a full body mesh.
How is this ML? They use a CNN for foreground segmentation, a minor step in their pipeline. But the major contribution seems to be putting the silhouettes in a common reference frame. I sincerely hope sciencemag isn’t putting ML in the title purely to jump on the bandwagon.
To be fair, they do have examples that aren’t chroma keyed; they just lead with one that is.
Which is not to say that ML is necessary for this sort of computer vision task, but I wonder if it yields better or sharper results than other techniques?
Same. As someone who has spent an embarrassing amount of time keying and tracking video footage over the years, I’m surprised ML isn’t being used for this more often in studios by now.
They say “standard” video is the source, so it would likely be on the order of 30 or 60 fps. Seems to be around a couple hundred frames, give or take, though I suspect it could get _something_ out of fewer frames, and more would just incrementally improve the model.
I would expect minor textural differences in a hand-drawn or painted source would make it a lot harder to correlate points between frames, but it’s an interesting idea to think about!
In the case of Face ID, at least, you’d still have to transfer the measurements into the physical world, in a way that fools a system that has ostensibly been designed not to be fooled by masks.
I wonder if will see a future soon where a director can fully edit the positions and physical actions of the actors at post production.
basically, the whole scenes will be transferred to believable 3d models seemlessly, and you can reanimate parts of everything. I feel like that's doing to happen for sure, for big Hollywood productions at least (like the Marvel stuff)
This already happens a lot, most VFX heavy productions will have digital doubles of the main cast, and they can be used for as simple a reason as reframing a shot.
Your comment could give the impression this is drastically more simple to do than it is in reality. This is considered as something like the last frontier of VFX, and there still remains a lot of work to be done.
While you’re essentially correct, it is currently an overwhelmingly manual process. The amount of work and time necessary is substantial (some would say outrageous), and exponentially higher for certain types of shots. Many shots remain impossible or cost-defeating.
I'm going to guess they start with a generic human model that includes all limbs and extremities and then the "machine learning" process attempts to fit that model to the silhouettes extracted from the video.