|
|
|
|
|
by packet_nerd
1001 days ago
|
|
What would a good fine-tuning dataset for language translation look like? I want to try fine-tuning to machine translate to and from a fairly niche language (https://en.wikipedia.org/wiki/S'gaw_Karen_language). How much text would I need, and what format would be ideal? I have a number of book length texts, most only in the target language, and a few bilingual or multilingual. For the bilingual and multilingual texts, I can script out probably several thousand pairs of "translate the following text from <source_lang> to <target_lang>: <source_lang_text> <target_lang_text>". Do I need to vary the prompt and format, or can I expect the LLM to generalize to different translation requests? Is there value in repeating the material in different lengths? One set of sentence lengths, another paragraph, and another page or chapter length? Also what should be done with the monolingual texts, just ignore them? |
|
As noted below, extracting words or keyterms would maybe be a good idea, as they could be included in the training set.
The training set would the be comprised of the prompt, the translation, and keyterms. As you will want to vet the generated texts anyway, you could then decide if the foundational model was working enough. You could also try to run the largest "open" model you could find on the prompts, to see if those needed training as well. There are many different Llama models trained on HuggingFace for language pairs, so see if your languages are already built and test those.
I'm building a simple, Open Source ML pipeline manager at https://ai.featurebase.com/. I'd be down to help you with this!