|
|
|
|
|
by flimflamm
545 days ago
|
|
To create a patch, a small model is used to predict the likelihood for the next character in the input string. Input string: 'Lazy dog jumped over a fence.' Use the model to predict the likelihood of each character. For example: 100% sure the next character is 'a'.
Or maybe it's 10% sure it's 'a', 10% sure it's 'b', and so on.
Then we chunk character estimates together.
How many characters?
Enough characters so that the total uncertainty (entropy) in each chunk is about the same.
And there you have your 'patch' (or 'token'). |
|
That's not how it's described in Section 2.3 of the paper. They only use the entropy of the next byte and whether it exceeds a threshold (Global Constraint) or is larger than the preceding byte's entropy by another threshold (Approx. Monotonic Constraint).
That does mean that long repetitive sequences can result in pathologically long patches, as demonstrated in Appendix E.
But what I'm really curious about is the "small CNN byte-level model with 2-byte context" in Figure 3 (f), because it's never mentioned in any other part of the paper.