| HN Mirror

This is not entirely correct, from what I understand. The source of the "anomalous token phenomenon" is that those texts were included while training the tokenizer but not included when training the models. It is not clear they would necessarily induce the same effect otherwise (i.e., if both the tokenizer and LLMs were trained with those "counting" texts).

EDIT: notice that the "tokens" that trigger the "glitch" are not the numbers themselves but the usernames of the people counting on that subreddit (which appear nowhere in the training dataset, due to a cleaning step that removed the "counting" texts)