Hacker News new | ask | show | jobs
by jw4ng 1442 days ago
Without any supervised training data, it's pretty difficult to create a very good translation model. For many languages, data might only be available through religious domains such as the Bible, or not available at all. We created a dataset called NLLB-Seed for this reason --- it's approx 6K sentences available for 39 languages that translates a broad set of topics from Wikipedia. We found that with a dataset like NLLB-Seed, we're able to have sufficient supervised signal to jumpstart our automatic dataset creation pipeline. Of course, the more high quality aligned data the better the model performance, but our project explores how we can make models more efficient at learning even when the training data is small.

Importantly, models can learn from other languages are similar. If we train separate models for each direction on small amounts of data, the performance is significantly worse than grouping languages in one large multilingual model.