|
|
|
|
|
by cubie
986 days ago
|
|
By "irrespective of their relevance to the language modeling task", the authors mean that the semantic meaning of the tokens is not important. These 4 tokens can be completely replaced by newlines (i.e. tokens with no semantic meaning), and the perplexity as measured on a book of 65k tokens is nearly unaffected. The clue is really that these tokens are just used to "offload" attention scores - their semantic meaning is irrelevant. |
|