Hacker News new | ask | show | jobs
by xg15 616 days ago
Ah, so the model "sees" the tags as literal ASCII characters interspersed with special tokens? That would make more sense.
1 comments

More or less; they’re not literally the same tokens as “a”, “b”, “c” but I’d speculate the mapping is learned from some other examples of ASCII (or just Roman letters) being repeated in other obscure parts of Unicode — Gothic glyphs, bubble letters, etc. Once the model has seen enough ASCII represented as Unicode code points whose tokenizations alternate between meaningless and meaningful (e.g. “~l~i~k~e~ ~t~h~i~s”) it learns how to read it regardless of what the ”~” is.