|
|
|
|
|
by yorwba
547 days ago
|
|
> How many characters? Enough characters so that the total uncertainty (entropy) in each chunk is about the same. That's not how it's described in Section 2.3 of the paper. They only use the entropy of the next byte and whether it exceeds a threshold (Global Constraint) or is larger than the preceding byte's entropy by another threshold (Approx. Monotonic Constraint). That does mean that long repetitive sequences can result in pathologically long patches, as demonstrated in Appendix E. But what I'm really curious about is the "small CNN byte-level model with 2-byte context" in Figure 3 (f), because it's never mentioned in any other part of the paper. |
|
Good description! Maybe what parent got mixed up on is an alternate way to view this is trying to chunk bytes to have roughly similar information. EG we initially tried a bunch of patching schemes, EG, keep a running total of entropy until the total exceeds a threshold, but ended up finding simple things worked better.
I’ll see if we can add more information about the small CNN in a next update to arXiv paper.