Hacker News new | ask | show | jobs
by flimflamm 545 days ago
To create a patch, a small model is used to predict the likelihood for the next character in the input string. Input string: 'Lazy dog jumped over a fence.' Use the model to predict the likelihood of each character.

For example:

    100% sure the next character is 'a'.
    Or maybe it's 10% sure it's 'a', 10% sure it's 'b', and so on.
Then we chunk character estimates together. How many characters? Enough characters so that the total uncertainty (entropy) in each chunk is about the same. And there you have your 'patch' (or 'token').
2 comments

> How many characters? Enough characters so that the total uncertainty (entropy) in each chunk is about the same.

That's not how it's described in Section 2.3 of the paper. They only use the entropy of the next byte and whether it exceeds a threshold (Global Constraint) or is larger than the preceding byte's entropy by another threshold (Approx. Monotonic Constraint).

That does mean that long repetitive sequences can result in pathologically long patches, as demonstrated in Appendix E.

But what I'm really curious about is the "small CNN byte-level model with 2-byte context" in Figure 3 (f), because it's never mentioned in any other part of the paper.

(Author Here)

Good description! Maybe what parent got mixed up on is an alternate way to view this is trying to chunk bytes to have roughly similar information. EG we initially tried a bunch of patching schemes, EG, keep a running total of entropy until the total exceeds a threshold, but ended up finding simple things worked better.

I’ll see if we can add more information about the small CNN in a next update to arXiv paper.

I'm curious if you're aware of some papers from around 2005 on using contextual entropy to do unsupervised word segmentation on Chinese, and other languages that don't use spaces for word boundaries.

https://aclanthology.org/Y03-1017/ https://aclanthology.org/I05-1009/ https://aclanthology.org/P06-2056/

Exactly the same approach of segmenting a word when the entropy goes up compared to the previous byte.

It is also quite similar to Carl de Marcken's work for segmenting text and speech. He phrased everything in terms of minimum description length (MDL), but that is trivially the same thing as local entropy.

https://dspace.mit.edu/handle/1721.1/7191?show=full

At least I wasn't aware of this work, but thanks for the refs! I'm always curious to read papers from 10-20+ years ago that have similarly inspired ideas. If it makes sense, we'll mention those in the next related work update.
One way of thinking about the "Approximate Monotonic Constraint" is that you're running a quick and dirty edge detector on the entropy. Ie, you're clipping based on the gradient of per-byte entropy wrt timestep compared to detecting an edge based on gradient of per-pixel intensity wrt pixel coordinates. It would be interesting to look at the raw sequences of per-byte entropies to see how strongly these sorts of "edges" correlate with human interpretable boundaries (words, prefixes, suffixes, etc).
Figure 4 plots the entropy of each byte in "Daenerys Targeryen is in Game of Thrones, a fantasy epic by George R.R. Martin."
"That's not how it's described" - Thanks for the correction!
So a variant might be to try using a some standard compression algorithm to train with?