Hacker News new | ask | show | jobs
by spankalee 2131 days ago
With so much multi-image computational photography and video processing these days, I've been wondering whether we could have a multiple camera system (with cameras on the top, bottom, left, and right of the screen) and a processor that can simulate a camera in the center of the screen - or even dynamically moved to the eyes of the caller.

I know there's a bunch of research on viewpoint interpolation, but how close might we be to a dedicated processor to be able do this in a laptop, or at least specialized VC monitor?

6 comments

Apparently all current attempts resulted in very, very uncanny valleys. This thread mentions some current attempts (searching hn.algolia.com for 'gaze correction' will return additional threads).

https://news.ycombinator.com/item?id=24151123https://news.yc...

Even with multi-camera setups?
Seems possible. If the user is actually looking at the center of the screen then we only need to shift the view, not digitally move their eyes. That seems very doable with some GPU code.
> That seems very doable with some GPU code.

This seems about as hard as digitally moving eyes.

I think the main source of artifacts is going to be lighting and reflections. Specular color or reflections are only possible to see when light, surface position and normal, and observer are arranged in a specific way. If you have 2 or more cameras positioned elsewhere, there's no way to find out what color is visible to another camera in the center.

Modern AI can try to guess, but fundamentally there's no that info anywhere in the video. It can assume the object surface is made of small count of uniform materials, and extrapolate materials across picture and across frames, but this gonna fail too often for biologicals subjects like people.

Moving eyes means making decisions about human behavior, which is hard. Any weirdness will be very detectable. Just doing a 3D reconstruction with multiple cameras is more established field.
> Just doing a 3D reconstruction with multiple cameras is more established field

Yes, but that alone is not enough. You can indeed reconstruct 3D after spending enough resources, but that won’t help you finding out which color the camera is going to see, because of these reflection issues. Human eyeballs are very reflective. Even if you approximate them with spheres and distort the reflections accordingly, next subject will wear eyeglasses, the reflecting shape of these is arbitrary, you have no chance of doing that accurately enough.

The worst-case example is a person wearing eyeglasses which are completely flat on the outside. No matter how many cameras are around the screen, none of them will capture what would reflect in the eyeglasses for a missing camera at the center of the screen.

I think people will eventually solve that, not with AI postprocessing, with hardware. You can place a camera behind center of the screen, and split time between display and camera. For example, you light the display for 10ms, and for the next 6.66ms you turn off the display and instead read data from the camera. This will get you 60Hz of both display and camera.

Yeah, I've long thought this should be pretty doable. At least with a good TOF camera.

Most of the literature I've seen has been on specifically gaze correction, which isn't actually what you would want.

Not sure if you can still edit your comment, but you may want to put a space between those links.
For those on mobile: here are the two links, should they not become split above

https://news.ycombinator.com/item?id=24151123

https://news.ycombinator.com/item?id=24151123

They're the same link; presumably an accidental double-paste. Still useful to have it working though. :)
IOS 13 has a feature that can do something to that effect:

https://arstechnica.com/gadgets/2019/07/facetime-feature-in-...

They didn't end up shipping it.
I was wondering what happened to it! Makes sense, weird uncanny eyes might be ok in a work conferencing tool but I think FaceTime is used for too many personal and intimate calls for it to be acceptable.
Tracking speakers is best done via audio already linked to camera control. Now face tracking by camera's in VC was something I first encountered late 90's - can't recall kit, but Sony was first on that - which was good for presentations in which the person speaking was standing and moving.

As for perspective shifting based upon multiple inputs - processing wise look at raytracing as would need to map each camera input to extrapolate the suface details and then map that out to the virtual visulisation. Basicly you would need to 3D map, including textures and re-render that viewpoint required.

However, do you need the whole face - you just really need to fix the eye's IMHO and eyeline contact.

But that is down to how we interact in meetings with people - try doing a video conference in which everybody is wearing dark sunglasses - that is insightful as you find people focus more upon what they hear more then.

Apple had this in a beta iOS and then removed it.
Interesting. I had heard that they do it by default in FaceTime, but I had not been able to detect it.
That doesn’t work right if you wear glasses with any significant optical distortion. In fact, the current takes on this make it significantly worse since they can’t figure out (or accurately simulate) eye position behind the lenses.
May be the in-display camera will solve it easier, where the camera is integrated beneath the display seeing through a semi transparent OLED panel.
Yes, doesn't the latest generation of smartphones already do this for the front camera?
Some have the camera behind the screen, but still at the top, far away from the center.