|
|
|
|
|
by yosito
1442 days ago
|
|
I'm curious how much work it takes to prepare training data for a language. From anecdotal experience, I've always been able to learn some basic survival skills in a new language by studying the translations of about 20 key phrases for a week or so, which give me the ability to combine them into a few hundred different phrases and survive most daily transactions. So I always imagine that training a language model is similar, just on a much larger scale. It seemed to me that there could be a standard text that includes a lot of important topics and contexts, which just needs to be manually translated into a target language and then fed to the model. I imagine it being about the size of a large book, so I imagine that adding a new language to a model would cost a similar amount to paying to have a book translated. Obviously the size of the input text would have an effect on how good the model's translations are, and domain specific translations would require more specific input. While having a full translation of an entire library seems like a good way to train a model that's used to translate everything, it seems like a small percentage of the library would be enough to produce native-level translations for most domains. How far off are my intuitions on this? What are the costs of adding a new language to a model like this? Is there a ballpark dollar amount per language? |
|
Importantly, models can learn from other languages are similar. If we train separate models for each direction on small amounts of data, the performance is significantly worse than grouping languages in one large multilingual model.