| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Timpy 188 days ago
	A lot of training data was curated in Kenya[0]. I would imagine if LLM data was curated in Japan our LLMs would sound a lot like the authors of their most popular English text books. Maybe other common Japanese idioms would leak in to the training data, like "ね" or "でしょう", ChatGPT would say "Don't you agree?" at the end of every message. [0] https://www.theverge.com/features/23764584/ai-artificial-int...

5 comments

erikig 187 days ago

The Indian-born textbook author mentioned (Malkiat Singh [0]) had an inordinate influence on many Kenyan students because his textbooks were the de-facto standard for years. Its interesting how this influence extends as his students get to curate the LLMs on which the world has come to rely.

[0] https://en.wikipedia.org/wiki/Malkiat_Singh

link

jojobas 187 days ago

So twists of training data procurement bring us the best of doing the needful through Africa.

link

m4rtink 188 days ago

You are completely right dajou~ ^_^ !

link

delis-thumbs-7e 187 days ago

Maybe we all should start writing Japanglish to show our authenticity? Or rather, ”Maybe we all should start writing the Japanglish, so that peoples can feel our real soul, you know?”

link

bakugo 187 days ago

I guess it can't be helped.

link

koakuma-chan 187 days ago

It's not because I like you or anything.

link

bpodgursky 187 days ago

This is a wild misunderstanding of LLMs. Data labeling has nothing to do with generating the astronomical text corpus used to train modern LLMs.

link

heavyset_go 187 days ago

The HF part of RLHF to refine the output of LLMs also happens in these places