Hacker News new | ask | show | jobs
by probably_wrong 977 days ago
The interesting part to me is that they are getting stereotypes instead of the average.

I have never in my life seen an Indian person with a beard and turban, nor I've ever met a Mexican person wearing a sombrero and poncho. And given how boring the results of generic prompts tend to be, my theory is that they specifically tweaked their training data to avoid getting "generic Indian worker wearing a shirt" in favor of "stereotypical Indian man that would make a good NatGeo cover".

5 comments

Right, part of me would expect a generated person of <x> ethnicity to look something like those images where they superimpose a bunch of faces to find the "facial average" of different countries: https://www.artfido.com/this-is-what-the-average-person-look...

I think it's probably a matter of the training data itself using stereotypical images, though. The first page of Google Image results for "mexican man" is almost entirely guys in hats, most of those sombreros. And those images are obviously getting tagged as "mexican man" in training data, but if you have an image of (for example) the frontman of a death metal band from Mexico, I'd assume that image wont get any tags about the band members' ethnicity because it's not obvious from the image context, nor is it the most striking thing about the image itself.

Hell, you could even have two different images of the same person: one where they're wearing a poncho and sombrero, one where they're wearing ripped jeans and skull face paint. I'm sure they'd be assigned wildly different tags.

> they are getting stereotypes instead of the average.

That sort of makes sense though. The training data is labeled images, and a picture of an average Indian in say an Indian newspaper or someone posting their own picture on their blog, won't be labeled "Indian", since within that context the nationality either doesn't matter or is a given. The training data would have to include the context like "if source url tld = .in" then add "India" to label. But that adds a whole host of other issues.

Someone correct me if I'm wrong.

The image model knows what images look like even without prompts, and if you train it on a trillion images it will create a latent space where similar pictures have similar embeddings. Inaccurate captions for some of them may mean that the text encoder can't get you to those embeddings, but they're still in there.

What this means is that text prompting is a bad way to drive an image generating model.

> I have never in my life seen an Indian person with a beard and turban

That’s surprising. Sikhs are a prominent feature of the Indian diaspora in North America and Europe. And after 9/11 they were in the news in the US since authorities had to inform their population that these were not the Muslims who had perpetrated the attacks.

Perhaps you have seen Indians with beards and turbans but, not being informed about Sikhism, you thought that they were from somewhere else like Afghanistan or the Middle East?

> I have never in my life seen an Indian person with a beard and turban

Wait really? I see Sikhs not too uncommonly. Sure they are the minority of ethnically Indian people I do see where I am (Australia), but I see Sikhs with the expected beard and turban pretty often, though admittedly I live in an area that has a decent number of immigrants

It’s both — the average depiction tends to be stereotype-based