Hacker News new | ask | show | jobs
by jryio 57 days ago
If anyone was wondering ... it's racist

Unsurprisingly the texts written up until that time were dominated by such individuals which is tragic for LLM training if you think about it.

The voiceless groups or fringe opinions which we take as normative today do not appear.

Does this encourage us to write in the present such that we influence the models in perpetuity?

5 comments

Voiceless groups do not appear in the training data? How could they, they are voiceless. You think the voiceless people are represented in todays training data? They cannot they are voiceless.

Nothing tragic about using data from a time period.

Common words used in 1900s are labeled racist now. I doubt anyone was wondering if they filtered those words for modern safe wordx.

I'd be more worried if words from that era were fully aligned with present day notions of morality. Wouldn't that indicate a certain stagnation & lack of progress?

Let us hope, 100 years from now, there will be people who look back unkindly on us.

As Proudhon said, "I dream of a society where I would be guillotined as a reactionary."
>The voiceless groups or fringe opinions which we take as normative today do not appear.

Times are different. Anybody with an internet connection can "publish" their thoughts and perspective online. LLMs scrape all of this. Modern datasets like CommonCrawl capture a vastly wider spectrum of humanity than a printing press ever could. The pre-1930 model acts as a time capsule of "gatekept publishing", but modern LLMs are trained on the democratized web.

>Does this encourage us to write in the present such that we influence the models in perpetuity?

I noticed a bunch of LLM-powered Reddit accounts praising products/services in dead threads. Or one bot posting a setup question, then a few other bots responding with praise / questions about a specific product in response. I don't know why they're doing this but I'm beginning to suspect it's something like this (get this positive sentiment into the datasets for the next generation of LLMs).

10 years ago people might had cared about your whining, not anymore (thank god)
one day we'll have SOTA models trained like this one and there's nothing you can do about it :^)