Hacker News new | ask | show | jobs
by omneity 89 days ago
Hey, this is super cool! I’ve been working on a similar problem, focusing on low-resource and underserved languages including the Mayan family, and have published some research and open resources around that [0, 1].

On the data side, I’ve found that the biggest bottleneck isn’t collecting text (it’s out there!) but reliable language identification. It’s often difficult or ambiguous to separate languages cleanly in datasets like Common Crawl, Fineweb, or others. I worked on improving this a bit for Fineweb 2 for my native language, that might inspire you [3].

Many of the challenges you mention seem to recur across regions and language families, so I’d love to connect and compare notes sometime. Feel free to reach me at omar [at] the labs site below.

0: https://wikilangs.org

1: https://omneitylabs.com

2: https://huggingface.co/blog/omarkamali/gherbal-multilingual-...

1 comments

You both might find it useful - https://news.ycombinator.com/item?id=44950661

I’ve also recently started in this space: building an agent, for a client, who can communicate in multiple languages.

Excellent, thank you mandeepj! Curious about the language coverage of your agent and if / how you plan to eval your agent, if you're willing to share more.
Regarding language coverage, we will start with the most frequently spoken languages first.

evaluating your agent: we are documenting the details, but it should give you some idea about an approach https://news.ycombinator.com/item?id=47232903

Also, you might find this useful - https://open.substack.com/pub/bytebytego/p/how-roblox-uses-a...