| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bhuztez 550 days ago

Being a fanboy of Universal(统一) Token(文字)， I think Chinese is the most easy one to work with. Since Chinese has no characters, it just have a few thousand tokens. Unicode code point is good starting point for Chinese.

What about English? Just as there is no natural boundary between tokens in English, there is no natural boundary between words in Chinese. Before LLM became popular, people had invented many ways to do Chinese word segmentation, just like nowadays people are inventing many ways to do tokenization.

However in the past, most of the time, you would end up with ngrams. If we learn that from history, ngrams should be a good starting point for English. For example, word "token" should be 3 tokens, "tok", "oke", "ken". Once add Chinese, everything should be just fine.

To be more controversial, I would say there is no such a language called Chinese. They are a group of languages who adopted Universal Token. Now it is time for English to jump on the bandwagon.