|
|
|
|
|
by spuz
932 days ago
|
|
One way to consider how an LLM "sees" text is to imagine a character based language like Chinese - each symbol is a syllable which can be a word on its own or part of a word. If you scramble words at the character level, you are going to produce combinations of letters that don't match any known symbol. It would be like drawing a series of random strokes and asking someone who knows Chinese what it means. If you look at the example given in the paper, the word "won" is a single token. When it is scrambled as "wno" it is tokenised as "w" and "no" both of which are unrelated to the original token "won". Somehow the LLM is able to relate these two completely different tokens "w" and "no" back to the original token "won". I think the paper is claiming this is surprising because these tokens shouldn't have any correlation with each other in its training data. |
|
I can attempt to produce a Japanese example by going to town on an example from Jreibun, but note that I am far from native:
後食に罠くなるのは生里像現なので壁けることはできないが、午後の事士の校率が干がるので木っている。
As far as I'm concerned, swapping out the radicals doesn't hurt that much (this is usually a negative, since it leads you to confuse character pairs like 候 and 侯, especially if you don't practice writing) and swapping the order of characters is a bit more annoying.
That said, a Mandarin one would be more convincing, since reordering the various markers that serve the roles of Japanese verb conjugations would be less disruptive than turning できなかった into っぎかてなた, which I did not do for that reason.
--
(The original sentence was 食後に眠くなるのは生理現象なので避けることはできないが、午後の仕事の効率が下がるので困っている。)