Hacker News new | ask | show | jobs
by LunarAurora 1228 days ago
Sadly, the very best datasets that seem publicly available are for Gulf Arabic dialect (where the money is) [1]

I suggest you contact https://www.icompass.tn/, a (Tunisian) startup specialized in Natural Language Processing...that process Arabic dialects and African languages

On a general note, I believe this kind of work should be a (urgently) nationally funded, because these countries will be forced to use second languages like French, or literary Arabic when AI/NLP becomes the dominant computing paradigm (bots, prompts...). A model in this respect is what Sweden is doing [1]. For mostly "oral" dialects (like Algerian I guess), collaborating with big names into adapting the best transcription models (like whisper) to them first is the key IMO.

[1] https://nyuad.nyu.edu/en/research/faculty-labs-and-projects/...

[2] https://news.ycombinator.com/item?id=34492572

1 comments

Hey, thanks for the reply, those are some very good points your raised. I'll explore the resources you shared as well.

The trick with this kind of project is the outcome. The way I was thinking about it was mostly as a personal side project. But if it requires more resources and effort than that then it's a different topic.

It's not clear who'd benefit from this, beyond an interesting curiosity to toy with here and there.

> mostly as a personal side project.

Yeah, why not? then you could make it open source for the next person to build on, like this https://www.researchgate.net/post/Any_available_algerian-dia...

> It's not clear who'd benefit from this.

A lot of people. Don't you think the Algerian government is monitoring social networks? how are they processing it? This is the most evident "security" need (for which states are generally very generous with their pockets).

In the longer term, as I said earlier, this is a key to everything from humanities research to daily computer use.

Yeah, I agree with you in spirit. It's just that the government might not see it the same way (mostly for lack of mastery).

But what I was tyring to say is : as you mentioned earlier this is a purely spoken language. Any "formal" communication happens either in French, Arabic or more rarely in English. As things stand right now, a dialect LM wouldn't get a lot of mileage.

Which is way I wanted to kinda limit the scope initially.