Hacker News new | ask | show | jobs
by blochist 261 days ago
Importantly, though, the token is "\\xadder" which looks a bit like an escaped hex code. That actually suggests a different origin of the token. `\xad` is the Unicode soft-hyphen (U+00AD). The soft hyphen is used to suggest where it makes sense to hyphenate a word if a line-break is needed. This shows up fairly frequently (2.9k occurrences) on GitHub in web-scraping datasets, which suggests that a model trained on data scraped from the web might see a fair number of these.

Basically, OpenAI is trained on web data that has a number of words where splitting on -der makes sense (e.g., mur-der, un-derstanding, won-derful; although the most common occurrence in a GitHub search for "\\xadder" is what appears to be an incorrectly encoded string "L\xc3\xadder", probably from the Portuguese and Spanish "Lí-der").

Anyways, using the o200k tokenizer `mur\xadder` yields two tokens (88762 and 179582). 88762 encodes "mur" and 179582 encodes "\xadder".