|
|
|
|
|
by DavidSJ
972 days ago
|
|
A uniform distribution over 30528 tokens is just under 15 bits of information per token, whereas a vocabulary size of ~60000 would be just under 16 bits per token. In practice it's not uniform, but this shows that they're in the same ballpark. |
|