| > You cannot describe a procedure that collects a representative sample without introducing bias. What does representative mean? Who decides what it means? Who gets to set the parameters of over vs under sampling? Perhaps you can take a representative (i.e. random and statistically significant enough) sample of the population and ask them their opinion about certain (especially controversial) pieces of your training data, then adjust your training data to weigh more heavily or less heavily based on these evaluations. That's just one idea that occurred to me from the top of my head, but I'm sure there are research scientists who can devise a better method than what I just came up with in 30 seconds. > Let's say that white nationalism is a tiny fraction of ideas online. Significantly less than 0.1%. Now, you randomly sample the internet and do not collect this idea into your training set. Do you adjust your approach to make sure it's represented (because as reprehensible as it is, it is the reality of online discourse in some places?) Sure. Otherwise you're in for a dangerous (and perhaps immoral) slippery slope. But it should be represented only as much as it is significant. Obviously you should not train your AI to weigh these ideas as much as others that are more prevalent. If it's only a tiny minority of the population that have such opinions, that should be reflected in the data (so that there is proportionally less data to account for these ideas). One would think that a sufficiently intelligent AI would not end up being a white nationalist, though (I'm not talking about current LLM technology, but perhaps some future version of it that is capable of something akin to self-reflection or deep thought). > I genuinely believe that it's all going to be biased -- there are no unbiased news or media outlets -- and the sooner you recognize everything is biased, the sooner you can move on to building the tools to recognize and understand that bias. News and media outlets are biased, yes, of course. The content from these sources is not generated from the population in general. That doesn't mean it's impossible to generate an unbiased sample of data (at least, up to a certain margin of error, depending on effort expended). |
And it gets worse. For instance, trans men have a totally different experience in rural vs coastal America vs Europe vs Africa. To get an AI who can speak confidently on what it is like to be trans male in those places will require even more interviews.
An that's before we get into set intersection territory. Take a simple example of being gay or straight, Black or white. Each of them is separately a unique experience. But being gay and white in America is very different from being gay and Black in America -- the two identities create 4 different intersections.
Now, you could say, "My AI simply will not speak about the experience of gay Black men, and the challenges/perspectives from that community", but then you've introduced a bias.
You could say, "Well, we'll go out and interview people from every set then, make sure we're covering everyone!" But where then do you stop sampling? Each additional modifier adds exponential complexity -- gay Black men from New Orleans will have a different experience from gay Black men from Lagos.