|
|
|
|
|
by kevingadd
1161 days ago
|
|
There's some nuance to this, I think. For one arbitrary example where this might not hold: NovelAI was trained on data from 'danbooru', an imageboard where people repost and tag art. All the tagging on that site is in English and they frequently also translate things like the author's description of the image and any in-image text. So if you were to use that site as a dataset, it would all be English. Or is it? The original source content was in a mix of languages - english, japanese, chinese, korean, etc. It only then got translated into english and tagged in english. So if you had trained on the original source content, you would have been training on a mix of languages, but that got erased for the convenience of the people training the network. |
|
But anyways the danbooru tags consist of things like: short hair, blue eyes, portrait. Things that are much more easier to translate (or "understand") in several languages than entire phrases like GPT does.