| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ktrnka 1273 days ago

I'd suggest starting with just building a high quality data set with text from a variety of domains, and starting off by publishing that. Maybe even developing some related tech like adding the dialect to language id packages. Another key thing might be to build a nicely curated word list for the dialect, and make sure there's good documentation for researchers wanting to work in the language.

Partly I'm feeling inspired by Google's machine translation paper about scaling to the next hundred or thousand languages. Some links in here https://ai.googleblog.com/2023/01/google-research-2022-beyon...

But also when it's been successful, it's an effort of many different researchers. And it usually starts with data.

Training a language model on top of it is definitely doable even for individuals, you just might not be able to train on a huge data set or you might hit a wall in terms of the perplexity you can reasonably train.

1 comments

ktrnka 1273 days ago

Also I'm happy to help any way I can. I'm not sure the best practices of sharing contact info on HN but if you Google K Trnka language modeling I should be the only one

link

the_generalist 1273 days ago

That's great, I think I've added you on LinkedIn

link