Hacker News new | ask | show | jobs
by genderwhy 1245 days ago
You cannot describe a procedure that collects a representative sample without introducing bias. What does representative mean? Who decides what it means? Who gets to set the parameters of over vs under sampling?

Let's say that white nationalism is a tiny fraction of ideas online. Significantly less than 0.1%. Now, you randomly sample the internet and do not collect this idea into your training set. Do you adjust your approach to make sure it's represented (because as reprehensible as it is, it is the reality of online discourse in some places?)

I genuinely believe that it's all going to be biased -- there are no unbiased news or media outlets -- and the sooner you recognize everything is biased, the sooner you can move on to building the tools to recognize and understand that bias.

Asking "why can't we strive to build an unbiased outlet" is to me like asking "why can't we build a ladder to the moon". It's an interesting question, but ultimately should lead you to "Well, why do you want that, and your approach is impossible but the outcome you want might not be."

1 comments

> You cannot describe a procedure that collects a representative sample without introducing bias. What does representative mean? Who decides what it means? Who gets to set the parameters of over vs under sampling?

Perhaps you can take a representative (i.e. random and statistically significant enough) sample of the population and ask them their opinion about certain (especially controversial) pieces of your training data, then adjust your training data to weigh more heavily or less heavily based on these evaluations.

That's just one idea that occurred to me from the top of my head, but I'm sure there are research scientists who can devise a better method than what I just came up with in 30 seconds.

> Let's say that white nationalism is a tiny fraction of ideas online. Significantly less than 0.1%. Now, you randomly sample the internet and do not collect this idea into your training set. Do you adjust your approach to make sure it's represented (because as reprehensible as it is, it is the reality of online discourse in some places?)

Sure. Otherwise you're in for a dangerous (and perhaps immoral) slippery slope. But it should be represented only as much as it is significant. Obviously you should not train your AI to weigh these ideas as much as others that are more prevalent. If it's only a tiny minority of the population that have such opinions, that should be reflected in the data (so that there is proportionally less data to account for these ideas).

One would think that a sufficiently intelligent AI would not end up being a white nationalist, though (I'm not talking about current LLM technology, but perhaps some future version of it that is capable of something akin to self-reflection or deep thought).

> I genuinely believe that it's all going to be biased -- there are no unbiased news or media outlets -- and the sooner you recognize everything is biased, the sooner you can move on to building the tools to recognize and understand that bias.

News and media outlets are biased, yes, of course. The content from these sources is not generated from the population in general.

That doesn't mean it's impossible to generate an unbiased sample of data (at least, up to a certain margin of error, depending on effort expended).

The approach you describe has the problem that it's asking majority people about the experiences of minority folks -- for instance, if you ask a statistically significant sample of the population about what it is like to be a trans man, you are going to either a) have to spend a TON of effort to interview a trans masc population, or b) going to be asking a bunch of people who have no idea what it is like.

And it gets worse. For instance, trans men have a totally different experience in rural vs coastal America vs Europe vs Africa. To get an AI who can speak confidently on what it is like to be trans male in those places will require even more interviews.

An that's before we get into set intersection territory. Take a simple example of being gay or straight, Black or white. Each of them is separately a unique experience. But being gay and white in America is very different from being gay and Black in America -- the two identities create 4 different intersections.

Now, you could say, "My AI simply will not speak about the experience of gay Black men, and the challenges/perspectives from that community", but then you've introduced a bias.

You could say, "Well, we'll go out and interview people from every set then, make sure we're covering everyone!" But where then do you stop sampling? Each additional modifier adds exponential complexity -- gay Black men from New Orleans will have a different experience from gay Black men from Lagos.

> The approach you describe has the problem that it's asking majority people about the experiences of minority folks

No, my approach is asking all types of people about the experience of minority folks, including those minority folks (we are all minority folks in some aspect, even if this aspect is uninteresting).

> for instance, if you ask a statistically significant sample of the population about what it is like to be a trans man, you are going to (...) be asking a bunch of people who have no idea what it is like.

Then those people can answer that they don't know what it's like to be trans.

If somebody comes up to me and asks me: "what is it like to be trans?". My answer would obviously be: "how the hell should I know? I'm not trans".

But trans people can answer what it's like to be trans.

> And it gets worse. For instance, trans men have a totally different experience in rural vs coastal America vs Europe vs Africa. To get an AI who can speak confidently on what it is like to be trans male in those places will require even more interviews.

Yes, you can only spend a limited amount of effort towards the goal of being unbiased. The goal is to be as unbiased as possible given that limited amount of effort.

It's still better to make X amount of effort to be unbiased than zero effort.

This is also something that can be improved over time, as better ideas and methods become available regarding how to measure and decrease bias.

Perhaps even an AI can be used to detect these biases and reduce them as best possible.

> Now, you could say, "My AI simply will not speak about the experience of gay Black men, and the challenges/perspectives from that community", but then you've introduced a bias.

Or perhaps the AI can simply answer based on the information it was trained on, making a best guess as to what that would be like, taking into account all the data that was available to it and how that data was weighed to be as unbiased as possible.

> You could say, "Well, we'll go out and interview people from every set then, make sure we're covering everyone!"

No, I think you are making a significant mistake in this reasoning. There is no "every set". There is only one set. And that is the set of all people.

> But where then do you stop sampling? Each additional modifier adds exponential complexity -- gay Black men from New Orleans will have a different experience from gay Black men from Lagos.

What modifier? There is no modifier. "SELECT RANDOM(x%) FROM TABLE all_people" (or whatever the imaginary SQL syntax would be) :)

> The goal is to be as unbiased as possible given that limited amount of effort.

So you are therefore biased. You have a finite set of resources, and you are choosing to allocate them in a particular way. That is bias.

You could equally choose to allocate those resources away from the majority, which would also be bias. Any time a human is making an editorial decision about how to allocate resources, you are introducing bias.

> > The goal is to be as unbiased as possible given that limited amount of effort.

> So you are therefore biased

Yes, but significantly less than before. Which is the goal.

> You have a finite set of resources, and you are choosing to allocate them in a particular way. That is bias.

That "particular way" is to give more weight to opinions that are under-represented in your training data and give less weight to opinions that are over-represented in the training data.

This is called "removing bias".

> You could equally choose to allocate those resources away from the majority, which would also be bias. Any time a human is making an editorial decision about how to allocate resources, you are introducing bias.

So, in your view, bias can only increase, it can never decrease?

Even if that were so, you are admitting that not all data is equally biased. Which means that it is possible to feed less biased data to an AI.

And the goal is not for "a human" to make an editorial decision. It's for the opinions used for the training data to be representative of all people, weighed according to (a representative sample of) these people (so you wouldn't be giving more weigh to the opinions of one person versus another).

I object specifically to this:

> Saying "there's no way to avoid the bias" is just an excuse to get away with being biased

And specifically to the implied idea that there is an unbiased target that is achievable.