Hacker News new | ask | show | jobs
by FredPret 1168 days ago
What an interesting aspect I haven't considered before. All the AIs will be trained on the available media - most of which is English.

I sometimes wonder what it takes to unseat a lingua franca, but it looks like we won't see that soon. English is set to dominate for a long time.

7 comments

Doesn't really matter. There's lots of positive transfer in individual language learning. Competence in one language bleeds into competence in others. https://arxiv.org/abs/2108.13349

GPT-3 is fluent in many languages despite English taking up 93% of the corpus by word count. French is next with 1.8%

https://github.com/openai/gpt-3/blob/master/dataset_statisti...

Dunno the statistics of language presence with GPT-4 but it takes it up another level in terms of its multilingual capabilities.

I posted on another thread that not only does GPT4 handle Norwegian just fine (0.1% of training data for GPT3), but Norway has two official languages that are mutually intelligible and close enough that some would consider them dialects, but GPT can handle Nynorsk, the smaller of the two (Bokmål being the other) just fine.

Going one step further, I asked it to "translate" into both "Riksmål", an artificial conservative variant of Bokmål that basically rejects most of the last few decades worth of language reforms, as well as Romeriksdialect (dialect from the Eastern part of Norway)... For the latter it gave me a lecture about how it varies internally in the region (which is correct) and presented a "translation" of a test sentence that is recognisably one of the variants from the Northern part of the region.

Of course for these competency definitely bleeds over. They share an almost identical grammar and a majority of orthography, but I'm impressed enough it can handle Norwegian that well at all, much less that it knows the distinctions between the variants.

Yeah, its language skills are through the roof. There's no reason to talk to it in English. From what I can tell, it does a decent job of translating out of even languages like Southern Sami, with ~300 speakers and utterly neglible training corpus. It seems it knows enough about grammar from related languages, and can infer enough from context (and maybe even etymology) that it does an OK job.

I tested it by giving it some news articles from NRK Sápmi, and compare it with the Norwegian translation they have.

Edit: Seems I may have gotten lucky that time, it's being a lot more, um, creative in its translation now. Or for all I know it could be changes in the model.

Looking at the basic ChatGPT (not GPT-4) while it can do reasonable translations for smaller languages and answer questions in them, the quality of the answers suffers significantly in my experience, if I ask the same factual question in two languages, I often see that the English one gets a correct answer while the small language gets a coherent hallucination. For big languages (French, Japanese, Spanish, etc) that's not an issue, but for the smaller ones it clearly is.
> There's no reason to talk to it in English.

Depends what you're doing. I haven't managed to make it continue after it stopped in the middle of a sentence in Japanese, but giving it the instruction to do so in English does. In some other cases, prompting in English (and asking for an answer in Japanese) can produce better results than giving the same prompt in Japanese.

reply "続けて" or "continue" works.

Generating Japanese is slower than English (it's annoying on GPT-4), that's my reason to prefer English sometimes (especially for tech topics). ChatGPT web users don't pay for each token, but API users pay for each token, so they would make different decision.

In my experience, while "continue" can work, "続けて" doesn't. At least not when making it rewrite large texts, which is when I hit the limit. With "continue", it continues rewriting. With "続けて", it tends to make up new text, that yes, is the continuation of what it was writing, but with no connection to the original text it was in the middle of rewriting.
ChatGPT speaks a ton of languages and very well at that. Hell it is better at my native language than I am and I am from a pretty small country.
This may be backwards. When AI can cheaply, quickly and with nuance intact translate between languages, it becomes easier to use a preferred non-dominant language, which would make English less dominant. There's less incentive to spend so much time learning this oddly irregular foreign tongue if the skill is embedded in your phone.
> All the AIs will be trained on the available media - most of which is English.

Is it?

The pile is an open dataset, and so is libgen. Should be pretty easy to confirm.
There's some nuance to this, I think. For one arbitrary example where this might not hold: NovelAI was trained on data from 'danbooru', an imageboard where people repost and tag art. All the tagging on that site is in English and they frequently also translate things like the author's description of the image and any in-image text. So if you were to use that site as a dataset, it would all be English.

Or is it? The original source content was in a mix of languages - english, japanese, chinese, korean, etc. It only then got translated into english and tagged in english. So if you had trained on the original source content, you would have been training on a mix of languages, but that got erased for the convenience of the people training the network.

Even if the original images have a mix of languages I think the tagging is all done in english (I may be wrong). I would argue that the source material includes the tagging as it is necessary for the AI to get trained so the content is not really mixed but entirely english.

But anyways the danbooru tags consist of things like: short hair, blue eyes, portrait. Things that are much more easier to translate (or "understand") in several languages than entire phrases like GPT does.

Yeah, the danbooru tagging is done in english. However, if the art is sourced from places like Pixiv, those sites do tagging in the site's native language. My point is that the original content was in a mix of languages, but the process of tagging and training normalized it all into english and results in a situation where even the people who authored the original art will now pay more to use the resulting networks if billed per-token unless they learn English. So we're basically taking all this input from various cultures, Englishifying it, and then potentially billing them more if they want to keep using their native tongue. Kind of sad.
Libgen is 57% English (17% Russian, 8% German) [1]. By comparison, 10% of Wikipedia is in English [2] (going by number of files and number of articles respectively, both flawed metrics)

Though I feel that's answering a slightly different question. Data used to train currently popular models is mostly English, and the marjority of data in sources popular in the anglosphere is English. Neither of these show whether the majority of available media is English.

https://www.reddit.com/r/libgen/comments/r3lzg2/top_15_langu...

https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#Co...

FWIW Chinese tech companies have a lot of stuff also that is really impressive like WuDao 2.0. They just don't get the same amount of press.
An old version of Google’s MoE without any benchmarks is impressive?
I'm wondering about this in the context of new programming languages. If people are using LLMs to learn a new language, will a new programming language be at a disadvantage until there's a critical mass of code, comparisons to existing languages, Rosetta Stone style examples, etc?
> All the AIs will be trained on the available media - most of which is English.

Are you sure about that? Most of the media we see, sure, but there has been, and still is a lot of media being produced in other languages.