Forgive me if this is a naive assumption, but wouldn’t large language models be fundamentally different for a language that is largely symbols?
Again, my understanding of Mandarin is limited if it exists at all.
This is why misspellings and homophones are tells of human righting. LLMs strongly prefer word-level tokens, and word substitutions follow semantic similarity and not the more human auditory similarity.
> LLMs strongly prefer word-level tokens, and word substitutions follow semantic similarity and not the more human auditory similarity.
Is this an elaborate joke or your full-word misspelling of writing is both agreeing with your statement (word substitutions) and contradicting it (not semantic but only pronunciation similarity)
"飞机" and "airplane" aren't fundamentally different in terms of how they're represented to a computer. Especially for an LLM, where tokenization likely turns each of those into a single token.