Hacker News new | ask | show | jobs
by incomingpain 74 days ago
I dont agree with the premise that is where academia sits. University of Dallas or University of Calgary is not going to show up in the far left. Not to mention religious universities like Redeemer in Hamilton.

In fact, it should balance out, especially over centuries of global content. There's absolutely no chance that the training data itself is the bias. IT's the filtering and labelling of the content that introduces the bias.

The AI companies are taking left wing content and labelling them "high quality prestige" and then looking at right wing content and labelling it "opinion low quality" or whatever. That is where the bias is occurring.

1 comments

> In fact, it should balance out, especially over centuries of global content.

I do kinda wonder how this is divided up. I wouldn't be surprised if the median authorship-of-a-word age in these things' training sets is post-smartphone. Consider the sheer volume of video uploaded to Youtube in a day (and corresponding volume of transcript text) and that posting on a social media site or sending an email is way lower-effort than that. The amount of material we've been able to more-or-less durably record in the last couple decades dwarfs everything that came before.

Choices of languages to ingest would also tend to make it a bit less "global" than might be ideal.

> IT's the filtering and labelling of the content that introduces the bias.

Oh, I agree and mentioned that some factor must be adjusting them away from the right. There's just far too much pro-authoritarianism and economically right-wing writing (to include Web posts, podcast or digitized radio show transcripts, et c) for them not to lean farther that way without some form of adjustment going on, even if it's only tone-based (and sure, there's probably more than that going on)

The trouble is these data sets per se can hardly be called unbiased with respect to most any plausibly-useful reference point one might choose, so whether they adjust or not, the result will be some kind of bias, except with respect to the training dataset itself (obviously). Like, the sheer count of the positive representations of an idea in the data they've been able to get ahold of means neither that it's as commonly-positively-regarded in the wild as it appears from that narrow window (see: many observations about how very-few readers of social media or forums et c. post anything) nor (separately) that it's better-supported by evidence or reason or what-have-you than alternatives.