Hacker News new | ask | show | jobs
by cubie 986 days ago
By "irrespective of their relevance to the language modeling task", the authors mean that the semantic meaning of the tokens is not important. These 4 tokens can be completely replaced by newlines (i.e. tokens with no semantic meaning), and the perplexity as measured on a book of 65k tokens is nearly unaffected.

The clue is really that these tokens are just used to "offload" attention scores - their semantic meaning is irrelevant.