| HN Mirror

The tokenizer as far as i know is just byte-pair encoding. You take your whole corpus, you find the most common 2 byte pair (probably .[space] for the first iteration) and you assign it to a token. Then, you do it again with the previously found token as possible parts of the byte pairs. Do it enough times and enventually you get full words as tokens if they're common enough, and for more uncommon words just the root of the word (and then later you can assemble root+ing for example, ing being just a normal token among others).

It struggles with rot13 because people don't generally make large corpuses of text rot13 available, next to their translation, so the problem compounds. On one hand there are probably not many rot-13'd words recognized by the tokenizer, and on the other hand even if there were the model wouldn't be trained to predict the correct translation after these tokens because there are very little rot13 roseta stones just laying around.