Hacker News new | ask | show | jobs
by mike_hearn 1108 days ago
It's pretty much guaranteed. Where else on the internet would this sequence of characters appear so frequently that it gets selected as one of the internet's top ~50,000 words?

Also, that Reddit is frequently used to train LLMs is widely known. It's an unusually clean source of conversational text because you can slice threads (i.e. pick a root comment, then pick a child, then a child of the child etc and then concatenate the results), and you'll get a coherent conversation. There are relatively few places on the internet where that is true. For example most phpBB forums conflate many different conversations into single threads, with ad-hoc quoting being used to disambiguate which post is replying to which. That makes it a lot harder to generate sample conversations from.

1 comments

>There are relatively few places on the internet where that is true

Imageboards.

DailyMail.

Slashdot.

Even a somethingawful dump would have been superior.

Slashdot doesn't have the volume. Don't know about image boards but are they threaded and do they cover as many topics?

The Daily Mail (the newspaper) has been used for training LLMs in the past, yes. I don't know if it still is.

imageboards sure do -- poorly.

listen, some of the niche corners of that world aren't so bad, but it ain't the place to be training AI to do something, unless that something is a hate crime

So filter the content before you use it. Clearly openAI did the bare minimum on this front.