| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bobbylarrybobby 251 days ago

What determines which “average” AI models latch onto? At a pixel level, the average of every image is a grayish rectangle; that's obviously not what we mean and AI does not produce that. At a slightly higher level, the average of every image is the average of every subject every photographed or drawn (human, tree, house, plate of food, ...) in concept space; but AI still doesn't generate a human with branches or a house with spaghetti on it. At a still higher level there are things we recognize as sensible scenes, e.g., barista pouring a cup of coffee, anime scene of a guy fighting a robot, watercolor of a boat on a lake, which AI still does not (by default) average into, say, an equal parts watercolor/anime/photorealistic image of a barista fighting a robot on a boat while pouring a cup of coffee.

But it is undeniable that AI images do have an “average” feel to them. What causes this? What is the space over which AI is taking an average to produce its output? One possible answer is that a finite model size means that the model can only explore image space with a limited resolution, and as models get bigger/better they can average over a smaller and smaller portion of this space, but it is always limited.

But that raises the question of why models don't just naturally land on a point in image space. Is this just a limitation of training, which punishes big failures more strongly than it rewards perfection? Or is there something else at play here that's preventing models from landing directly on a “real” image?

4 comments

minimaxir 251 days ago

> At a pixel level, the average of every image is a grayish rectangle; that's obviously not what we mean and AI does not produce that.

That isn't correct since images in the real world aren't uniformly distributed from [0, 255] color-wise. Take, for example, the famous ImageNet normalization magic numbers:

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])

If it were actually uniformly distributed, the mean for each channel would be 0.5 and the standard deviation would be 0.289. Also due to z-normalization, the "image" most image models see is not how humans typically see images.

link

red75prime 251 days ago

The model "averages" in the latent space. That is in the space of packed image representations. I put "averages" into scare quotes, because I think it might be due to legal reasons. The model training might be organized in such a way as to push its default style away from styles of prominent artists. I might be wrong though.

link

azeirah 251 days ago

Isn't the space you're talking about the input images that are close to the textual prompt?

These models are trained on image+text pairs. So if you prompt something like "an apple" you get a conceptual average of all images containing apples. Depending on your dataset, it's likely going to be a photograph of an apple in the center.

link

kovek 250 days ago

See the third diagram in https://www.mdpi.com/1424-8220/24/18/6049 . There are elements of noise, of input embeddings in the form of images, or in the form of text.

link