Hacker News new | ask | show | jobs
by bertil 1445 days ago
Are there online corpora, like Wikipedia, that could be used to train the models? Are those under a permissive enough license to be used for model training?

If there are spoken, with enough budget, a library of voices could be recorded. I think you’d prefer that collection to be gathered and maintained by a non-profit rather than Meta.

1 comments

For náhuatl, I found this: Wikipedia in nahuatl https://nah.wikipedia.org/wiki/Cal%C4%ABxatl
I’m wondering if 7065 articles is enough to train the model.