|
|
|
|
|
by omneity
89 days ago
|
|
Hey, this is super cool! I’ve been working on a similar problem, focusing on low-resource and underserved languages including the Mayan family, and have published some research and open resources around that [0, 1]. On the data side, I’ve found that the biggest bottleneck isn’t collecting text (it’s out there!) but reliable language identification. It’s often difficult or ambiguous to separate languages cleanly in datasets like Common Crawl, Fineweb, or others. I worked on improving this a bit for Fineweb 2 for my native language, that might inspire you [3]. Many of the challenges you mention seem to recur across regions and language families, so I’d love to connect and compare notes sometime. Feel free to reach me at omar [at] the labs site below. 0: https://wikilangs.org 1: https://omneitylabs.com 2: https://huggingface.co/blog/omarkamali/gherbal-multilingual-... |
|
I’ve also recently started in this space: building an agent, for a client, who can communicate in multiple languages.