Hacker News new | ask | show | jobs
by otreblatercero 1445 days ago
Not a single mesoamerican language is present. Maya, Náhuatl, Otomí, Zapoteco, etc. And these languages are big, they are spoken by millions and even have literature. Náhuatl and Maya are spoken in Central America.
1 comments

Are there online corpora, like Wikipedia, that could be used to train the models? Are those under a permissive enough license to be used for model training?

If there are spoken, with enough budget, a library of voices could be recorded. I think you’d prefer that collection to be gathered and maintained by a non-profit rather than Meta.

For náhuatl, I found this: Wikipedia in nahuatl https://nah.wikipedia.org/wiki/Cal%C4%ABxatl
I’m wondering if 7065 articles is enough to train the model.