| HN Mirror

> In fact, it should balance out, especially over centuries of global content.

I do kinda wonder how this is divided up. I wouldn't be surprised if the median authorship-of-a-word age in these things' training sets is post-smartphone. Consider the sheer volume of video uploaded to Youtube in a day (and corresponding volume of transcript text) and that posting on a social media site or sending an email is way lower-effort than that. The amount of material we've been able to more-or-less durably record in the last couple decades dwarfs everything that came before.

Choices of languages to ingest would also tend to make it a bit less "global" than might be ideal.

> IT's the filtering and labelling of the content that introduces the bias.

Oh, I agree and mentioned that some factor must be adjusting them away from the right. There's just far too much pro-authoritarianism and economically right-wing writing (to include Web posts, podcast or digitized radio show transcripts, et c) for them not to lean farther that way without some form of adjustment going on, even if it's only tone-based (and sure, there's probably more than that going on)

The trouble is these data sets per se can hardly be called unbiased with respect to most any plausibly-useful reference point one might choose, so whether they adjust or not, the result will be some kind of bias, except with respect to the training dataset itself (obviously). Like, the sheer count of the positive representations of an idea in the data they've been able to get ahold of means neither that it's as commonly-positively-regarded in the wild as it appears from that narrow window (see: many observations about how very-few readers of social media or forums et c. post anything) nor (separately) that it's better-supported by evidence or reason or what-have-you than alternatives.