| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by generalizations 1161 days ago
	The pile is an open dataset, and so is libgen. Should be pretty easy to confirm.

2 comments

kevingadd 1161 days ago

There's some nuance to this, I think. For one arbitrary example where this might not hold: NovelAI was trained on data from 'danbooru', an imageboard where people repost and tag art. All the tagging on that site is in English and they frequently also translate things like the author's description of the image and any in-image text. So if you were to use that site as a dataset, it would all be English.

Or is it? The original source content was in a mix of languages - english, japanese, chinese, korean, etc. It only then got translated into english and tagged in english. So if you had trained on the original source content, you would have been training on a mix of languages, but that got erased for the convenience of the people training the network.

link

unaindz 1161 days ago

Even if the original images have a mix of languages I think the tagging is all done in english (I may be wrong). I would argue that the source material includes the tagging as it is necessary for the AI to get trained so the content is not really mixed but entirely english.

But anyways the danbooru tags consist of things like: short hair, blue eyes, portrait. Things that are much more easier to translate (or "understand") in several languages than entire phrases like GPT does.

link

kevingadd 1160 days ago

Yeah, the danbooru tagging is done in english. However, if the art is sourced from places like Pixiv, those sites do tagging in the site's native language. My point is that the original content was in a mix of languages, but the process of tagging and training normalized it all into english and results in a situation where even the people who authored the original art will now pay more to use the resulting networks if billed per-token unless they learn English. So we're basically taking all this input from various cultures, Englishifying it, and then potentially billing them more if they want to keep using their native tongue. Kind of sad.

link

wongarsu 1161 days ago

Libgen is 57% English (17% Russian, 8% German) [1]. By comparison, 10% of Wikipedia is in English [2] (going by number of files and number of articles respectively, both flawed metrics)

Though I feel that's answering a slightly different question. Data used to train currently popular models is mostly English, and the marjority of data in sources popular in the anglosphere is English. Neither of these show whether the majority of available media is English.

https://www.reddit.com/r/libgen/comments/r3lzg2/top_15_langu...

https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#Co...

link