|
|
|
|
|
by genderwhy
1245 days ago
|
|
You cannot describe a procedure that collects a representative sample without introducing bias. What does representative mean? Who decides what it means? Who gets to set the parameters of over vs under sampling? Let's say that white nationalism is a tiny fraction of ideas online. Significantly less than 0.1%. Now, you randomly sample the internet and do not collect this idea into your training set. Do you adjust your approach to make sure it's represented (because as reprehensible as it is, it is the reality of online discourse in some places?) I genuinely believe that it's all going to be biased -- there are no unbiased news or media outlets -- and the sooner you recognize everything is biased, the sooner you can move on to building the tools to recognize and understand that bias. Asking "why can't we strive to build an unbiased outlet" is to me like asking "why can't we build a ladder to the moon". It's an interesting question, but ultimately should lead you to "Well, why do you want that, and your approach is impossible but the outcome you want might not be." |
|
Perhaps you can take a representative (i.e. random and statistically significant enough) sample of the population and ask them their opinion about certain (especially controversial) pieces of your training data, then adjust your training data to weigh more heavily or less heavily based on these evaluations.
That's just one idea that occurred to me from the top of my head, but I'm sure there are research scientists who can devise a better method than what I just came up with in 30 seconds.
> Let's say that white nationalism is a tiny fraction of ideas online. Significantly less than 0.1%. Now, you randomly sample the internet and do not collect this idea into your training set. Do you adjust your approach to make sure it's represented (because as reprehensible as it is, it is the reality of online discourse in some places?)
Sure. Otherwise you're in for a dangerous (and perhaps immoral) slippery slope. But it should be represented only as much as it is significant. Obviously you should not train your AI to weigh these ideas as much as others that are more prevalent. If it's only a tiny minority of the population that have such opinions, that should be reflected in the data (so that there is proportionally less data to account for these ideas).
One would think that a sufficiently intelligent AI would not end up being a white nationalist, though (I'm not talking about current LLM technology, but perhaps some future version of it that is capable of something akin to self-reflection or deep thought).
> I genuinely believe that it's all going to be biased -- there are no unbiased news or media outlets -- and the sooner you recognize everything is biased, the sooner you can move on to building the tools to recognize and understand that bias.
News and media outlets are biased, yes, of course. The content from these sources is not generated from the population in general.
That doesn't mean it's impossible to generate an unbiased sample of data (at least, up to a certain margin of error, depending on effort expended).