| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by countmora 1108 days ago
	> unfortunately the tokenizer was trained on this subreddit Do you have a source for that or was it just an assumption?

3 comments

mike_hearn 1108 days ago

It's pretty much guaranteed. Where else on the internet would this sequence of characters appear so frequently that it gets selected as one of the internet's top ~50,000 words?

Also, that Reddit is frequently used to train LLMs is widely known. It's an unusually clean source of conversational text because you can slice threads (i.e. pick a root comment, then pick a child, then a child of the child etc and then concatenate the results), and you'll get a coherent conversation. There are relatively few places on the internet where that is true. For example most phpBB forums conflate many different conversations into single threads, with ad-hoc quoting being used to disambiguate which post is replying to which. That makes it a lot harder to generate sample conversations from.

link

dontupvoteme 1107 days ago

>There are relatively few places on the internet where that is true

Imageboards.

DailyMail.

Slashdot.

Even a somethingawful dump would have been superior.

link

mike_hearn 1107 days ago

Slashdot doesn't have the volume. Don't know about image boards but are they threaded and do they cover as many topics?

The Daily Mail (the newspaper) has been used for training LLMs in the past, yes. I don't know if it still is.

link

red-iron-pine 1107 days ago

imageboards sure do -- poorly.

listen, some of the niche corners of that world aren't so bad, but it ain't the place to be training AI to do something, unless that something is a hate crime

link

dontupvoteme 1107 days ago

So filter the content before you use it. Clearly openAI did the bare minimum on this front.

link

gl-prod 1108 days ago

There was a video[0] on Computerphile about this topic

[0] https://www.youtube.com/watch?v=WO2X3oZEJOA

link

klooney 1106 days ago

See the old SolidGoldMagikarp drama- it's happened before.

link