| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jryio 57 days ago

If anyone was wondering ... it's racist

Unsurprisingly the texts written up until that time were dominated by such individuals which is tragic for LLM training if you think about it.

The voiceless groups or fringe opinions which we take as normative today do not appear.

Does this encourage us to write in the present such that we influence the models in perpetuity?

5 comments

ipaddr 57 days ago

Voiceless groups do not appear in the training data? How could they, they are voiceless. You think the voiceless people are represented in todays training data? They cannot they are voiceless.

Nothing tragic about using data from a time period.

Common words used in 1900s are labeled racist now. I doubt anyone was wondering if they filtered those words for modern safe wordx.

link

SuddsMcDuff 57 days ago

I'd be more worried if words from that era were fully aligned with present day notions of morality. Wouldn't that indicate a certain stagnation & lack of progress?

Let us hope, 100 years from now, there will be people who look back unkindly on us.

link

NoGravitas 56 days ago

As Proudhon said, "I dream of a society where I would be guillotined as a reactionary."

link

idonotknowwhy 56 days ago

>The voiceless groups or fringe opinions which we take as normative today do not appear.

Times are different. Anybody with an internet connection can "publish" their thoughts and perspective online. LLMs scrape all of this. Modern datasets like CommonCrawl capture a vastly wider spectrum of humanity than a printing press ever could. The pre-1930 model acts as a time capsule of "gatekept publishing", but modern LLMs are trained on the democratized web.

>Does this encourage us to write in the present such that we influence the models in perpetuity?

I noticed a bunch of LLM-powered Reddit accounts praising products/services in dead threads. Or one bot posting a setup question, then a few other bots responding with praise / questions about a specific product in response. I don't know why they're doing this but I'm beginning to suspect it's something like this (get this positive sentiment into the datasets for the next generation of LLMs).

link

dirasieb 56 days ago

10 years ago people might had cared about your whining, not anymore (thank god)

link

b65e8bee43c2ed0 57 days ago

one day we'll have SOTA models trained like this one and there's nothing you can do about it :^)

link