Ask HN: I want to train a LM on my home country's dialect, how can I do it?

Y	Hacker News new \| ask \| show \| jobs

24 points by the_generalist 1273 days ago

I'm from Algeria. The language spoken on a daily basis by almost everybody is a weird mix of different languages : french, arabic, english..etc.

I was thinking of grabbing data from tweets to fine-tune the model. I may be able to figure out other sources, but it's not gonna be much better than that. Just short-form text for the most part.

I was thinking of potentially leveraging the smaller models I came across recently (nanoGPT for example) or something similar.

I'm tech-savvy enough to make this work but I'd like some feedback from people more knowledgeable than me before I spend time and effort into this.

Thanks!

5 comments

ktrnka 1273 days ago

I'd suggest starting with just building a high quality data set with text from a variety of domains, and starting off by publishing that. Maybe even developing some related tech like adding the dialect to language id packages. Another key thing might be to build a nicely curated word list for the dialect, and make sure there's good documentation for researchers wanting to work in the language.

Partly I'm feeling inspired by Google's machine translation paper about scaling to the next hundred or thousand languages. Some links in here https://ai.googleblog.com/2023/01/google-research-2022-beyon...

But also when it's been successful, it's an effort of many different researchers. And it usually starts with data.

Training a language model on top of it is definitely doable even for individuals, you just might not be able to train on a huge data set or you might hit a wall in terms of the perplexity you can reasonably train.

link

ktrnka 1273 days ago

Also I'm happy to help any way I can. I'm not sure the best practices of sharing contact info on HN but if you Google K Trnka language modeling I should be the only one

link

the_generalist 1273 days ago

That's great, I think I've added you on LinkedIn

link

LunarAurora 1273 days ago

Sadly, the very best datasets that seem publicly available are for Gulf Arabic dialect (where the money is) [1]

I suggest you contact https://www.icompass.tn/, a (Tunisian) startup specialized in Natural Language Processing...that process Arabic dialects and African languages

On a general note, I believe this kind of work should be a (urgently) nationally funded, because these countries will be forced to use second languages like French, or literary Arabic when AI/NLP becomes the dominant computing paradigm (bots, prompts...). A model in this respect is what Sweden is doing [1]. For mostly "oral" dialects (like Algerian I guess), collaborating with big names into adapting the best transcription models (like whisper) to them first is the key IMO.

[1] https://nyuad.nyu.edu/en/research/faculty-labs-and-projects/...

[2] https://news.ycombinator.com/item?id=34492572

link

the_generalist 1273 days ago

Hey, thanks for the reply, those are some very good points your raised. I'll explore the resources you shared as well.

The trick with this kind of project is the outcome. The way I was thinking about it was mostly as a personal side project. But if it requires more resources and effort than that then it's a different topic.

It's not clear who'd benefit from this, beyond an interesting curiosity to toy with here and there.

link

LunarAurora 1273 days ago

> mostly as a personal side project.

Yeah, why not? then you could make it open source for the next person to build on, like this https://www.researchgate.net/post/Any_available_algerian-dia...

> It's not clear who'd benefit from this.

A lot of people. Don't you think the Algerian government is monitoring social networks? how are they processing it? This is the most evident "security" need (for which states are generally very generous with their pockets).

In the longer term, as I said earlier, this is a key to everything from humanities research to daily computer use.

link

the_generalist 1273 days ago

Yeah, I agree with you in spirit. It's just that the government might not see it the same way (mostly for lack of mastery).

But what I was tyring to say is : as you mentioned earlier this is a purely spoken language. Any "formal" communication happens either in French, Arabic or more rarely in English. As things stand right now, a dialect LM wouldn't get a lot of mileage.

Which is way I wanted to kinda limit the scope initially.

link

yorwba 1273 days ago

If all you want is a LM and it doesn't need to be trained by you or run on infrastructure you control, you could try to see whether ChatGPT already understands well enough. A Tunisian friend of mine told me that he asked it to tell a joke in Tunisian Arabic and it worked, only the joke wasn't funny.

If you want or need to train on your own data, social media is a good bet for colloquial language. You could try exporting your own data to get something to play with without having to write a crawler. Or try building a language classifier first and use it to filter https://commoncrawl.org/

link

the_generalist 1273 days ago

That's really interesting. I really wanted to keep things simple: get the data (scrape Twitter for ex), use HuggingFace's AutoML or something similar. I'm not sure if this is even possible but this was my initial "pipeline".

link

barrenko 1273 days ago

Do you know of further link or resources in this "direction", this is awesome.

link

enoreyes 1273 days ago

https://huggingface.co/alger-ia/dziribert

There is this model which also has a paper describing their methods for a BERT-family model designed for the Algerian dialect.

link

tooltitude 1273 days ago

You could do data augmentation. You could automatically translate (there're open source models to do so) to your language from close enough languages, and train your model on this data.

link