| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by r2vcap 242 days ago

While it makes sense that LLMs and machine translation systems primarily rely on English Wikipedia as a data source, depending on smaller-language Wikipedias for training is far less ideal. English Wikipedia is generally well-regulated by its community, but many other language editions are not — so treating all of Wikipedia as an authoritative source is misguided.

For instance, my mother tongue’s Wikipedia (Korean Wikipedia) suffers from serious governance issues. The community often rejects outside contributors, and many experienced editors have already moved to alternative platforms. As a result, I sometimes get mixed, low-quality responses in my native language when using LLMs.

Ultimately, we need high-quality open data. Yet most Korean-language content is locked behind walled gardens run by chaebols like Naver and Kakao — and now they’re lobbying the government to fund their own “sovereign AI” projects. It’s a lose-lose situation.