|
|
|
Ask HN: I want to train a LM on my home country's dialect, how can I do it?
|
|
24 points
by the_generalist
1226 days ago
|
|
I'm from Algeria. The language spoken on a daily basis by almost everybody is a weird mix of different languages : french, arabic, english..etc. I was thinking of grabbing data from tweets to fine-tune the model. I may be able to figure out other sources, but it's not gonna be much better than that. Just short-form text for the most part. I was thinking of potentially leveraging the smaller models I came across recently (nanoGPT for example) or something similar. I'm tech-savvy enough to make this work but I'd like some feedback from people more knowledgeable than me before I spend time and effort into this. Thanks! |
|
Partly I'm feeling inspired by Google's machine translation paper about scaling to the next hundred or thousand languages. Some links in here https://ai.googleblog.com/2023/01/google-research-2022-beyon...
But also when it's been successful, it's an effort of many different researchers. And it usually starts with data.
Training a language model on top of it is definitely doable even for individuals, you just might not be able to train on a huge data set or you might hit a wall in terms of the perplexity you can reasonably train.