Hacker News new | ask | show | jobs
by tel 932 days ago
I think it's because hands have the dual properties of being extremely important to us and not primarily visual. Our hands are our primary mechanism for interacting with the world in a conscious and directed fashion. We devote a lot of mental attention to them, how to use them, how other people are using them. That's true subjectively, but you can also read it off of the Cortical Homunculus findings [1]. In short, though, we're extremely sensitive to whether hands are rendered properly and meaningfully.

And, then, unlike faces, there is relatively little visual data in the world showing exactly how hands work. Unlike faces they're not often the focal point of an image. Unlike faces, they don't present mostly forward and so in any particular image their visualization is only partial. Unlike faces hands are often defined by how they interact with any other complex object in a scene.

So we're both tough critics of hands and image models have relatively less training data.

For what it's worth, as well, it's evident that image models are only good at depicting many things gesturally. At the same time, so are painters. If you're a photographer, you can often spot fake images if you notice that the exposure, focus, or lighting is implausible. If you're a mathematician, you'll notice every chalkboard full of equations is nonsense in both AI images and most Hollywood movies. If you're a botanist, I'm sure you think every AI image with a background of trees looks weird.

And then it turns out that nearly every human being is a hand-ologist to a large degree.

For another interesting experience, take a look at the Clone synthetic hand [2] which is quite obviously artificial but also, from time to time, looks surprisingly human. We're quite clearly sensitive to exactly the musculature and range of motion of our hands and know exactly what's feasible, what's painful, what feels natural and unnatural given the exact constraints of how our hand is constructed. When those limits are probed it's immediately obvious.

[1] https://en.wikipedia.org/wiki/Cortical_homunculus [2] https://www.youtube.com/watch?v=A4Gp8oQey5M&t=20s

1 comments

You’re partially correct, but this isn’t an explanation for why they’re rendered wrong

Hands are extremely complicated mechanically. They are the most complex creation evolution has come up with and part of the reason humans are able to do what they do.

Hands are like the chess game of anatomy, each segment of a hand has so many permutations that an AI simply doesn’t have enough reference info to animate it properly

I don't think we disagree, and I do think what I argue is sufficient for image generation models to fail to render hands well. What you add---that they are very complex---is true, I believe, but I avoided using it to argue as I'm not sure it's sufficient or necessary.

Generative models, arguable, have little trouble with complexity given enough training data. Faces are a perfect example. We both agree that image models, at least, lack that data for hands.

But there are many complex things that image models render with sparse training data which don't set off our perception as strongly. Hands fall into the uncanny valley: we are deeply familiar with them.

This is why I mention lighting and focus. They are subtle and complex. Additionally, image models have tons of training examples of each. That's still not enough for generative image models to consistently represent photography in a way that a person who has spent the time to build an accurate model of how camera images look would be fooled. But it fools most people.

The complexity of handling good lighting and focus involve both the generation of the entire scene that the photograph is taking place within and an accurate model of both the design of the camera and how it's been configured for the shot. Both of these are large spaces full of hidden variables that popular image models are not presently trained on.

Many people know you can look at the background of a generated image to identify irregularities. Checking that the lighting has a consistent angle (or multiple angles indicative of a cogent set of scene lights) is another good check. Additionally, if you have an eye for bokeh then when it appears in an image you can often detect whether it's faked. Finally, even smooth blurs often do not reflect either a physically plausible background being blurred or a consistent focal plane cutting through the 3d scene. All additional complexities that image generating models often don't have mastery over (for now). But also many judges of their outputs don't either, so it's easy to miss these "mistakes".