Hacker News new | ask | show | jobs
by countmora 1108 days ago
> unfortunately the tokenizer was trained on this subreddit

Do you have a source for that or was it just an assumption?

3 comments

It's pretty much guaranteed. Where else on the internet would this sequence of characters appear so frequently that it gets selected as one of the internet's top ~50,000 words?

Also, that Reddit is frequently used to train LLMs is widely known. It's an unusually clean source of conversational text because you can slice threads (i.e. pick a root comment, then pick a child, then a child of the child etc and then concatenate the results), and you'll get a coherent conversation. There are relatively few places on the internet where that is true. For example most phpBB forums conflate many different conversations into single threads, with ad-hoc quoting being used to disambiguate which post is replying to which. That makes it a lot harder to generate sample conversations from.

>There are relatively few places on the internet where that is true

Imageboards.

DailyMail.

Slashdot.

Even a somethingawful dump would have been superior.

Slashdot doesn't have the volume. Don't know about image boards but are they threaded and do they cover as many topics?

The Daily Mail (the newspaper) has been used for training LLMs in the past, yes. I don't know if it still is.

imageboards sure do -- poorly.

listen, some of the niche corners of that world aren't so bad, but it ain't the place to be training AI to do something, unless that something is a hate crime

So filter the content before you use it. Clearly openAI did the bare minimum on this front.
There was a video[0] on Computerphile about this topic

[0] https://www.youtube.com/watch?v=WO2X3oZEJOA

See the old SolidGoldMagikarp drama- it's happened before.