|
|
|
|
|
by yorwba
1231 days ago
|
|
If all you want is a LM and it doesn't need to be trained by you or run on infrastructure you control, you could try to see whether ChatGPT already understands well enough. A Tunisian friend of mine told me that he asked it to tell a joke in Tunisian Arabic and it worked, only the joke wasn't funny. If you want or need to train on your own data, social media is a good bet for colloquial language. You could try exporting your own data to get something to play with without having to write a crawler. Or try building a language classifier first and use it to filter https://commoncrawl.org/ |
|