| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by omneity 89 days ago

Hey, this is super cool! I’ve been working on a similar problem, focusing on low-resource and underserved languages including the Mayan family, and have published some research and open resources around that [0, 1].

On the data side, I’ve found that the biggest bottleneck isn’t collecting text (it’s out there!) but reliable language identification. It’s often difficult or ambiguous to separate languages cleanly in datasets like Common Crawl, Fineweb, or others. I worked on improving this a bit for Fineweb 2 for my native language, that might inspire you [3].

Many of the challenges you mention seem to recur across regions and language families, so I’d love to connect and compare notes sometime. Feel free to reach me at omar [at] the labs site below.

0: https://wikilangs.org

1: https://omneitylabs.com

2: https://huggingface.co/blog/omarkamali/gherbal-multilingual-...

1 comments

mandeepj 88 days ago

You both might find it useful - https://news.ycombinator.com/item?id=44950661

I’ve also recently started in this space: building an agent, for a client, who can communicate in multiple languages.

link

omneity 88 days ago

Excellent, thank you mandeepj! Curious about the language coverage of your agent and if / how you plan to eval your agent, if you're willing to share more.

link

mandeepj 78 days ago

Regarding language coverage, we will start with the most frequently spoken languages first.

evaluating your agent: we are documenting the details, but it should give you some idea about an approach https://news.ycombinator.com/item?id=47232903

Also, you might find this useful - https://open.substack.com/pub/bytebytego/p/how-roblox-uses-a...

link