Hacker News new | ask | show | jobs
by yorwba 1231 days ago
If all you want is a LM and it doesn't need to be trained by you or run on infrastructure you control, you could try to see whether ChatGPT already understands well enough. A Tunisian friend of mine told me that he asked it to tell a joke in Tunisian Arabic and it worked, only the joke wasn't funny.

If you want or need to train on your own data, social media is a good bet for colloquial language. You could try exporting your own data to get something to play with without having to write a crawler. Or try building a language classifier first and use it to filter https://commoncrawl.org/

2 comments

That's really interesting. I really wanted to keep things simple: get the data (scrape Twitter for ex), use HuggingFace's AutoML or something similar. I'm not sure if this is even possible but this was my initial "pipeline".
Do you know of further link or resources in this "direction", this is awesome.