Hacker News new | ask | show | jobs
by antirez 536 days ago
The way this works is awesome. If I understand correctly, it's like that, given (part of) a sentence, the next token really in the sequence will be one predicted by the model among the top scoring ones, so most next tokens can be mapped to very low numbers (0 if the actual next token it's the best token in the LLM prediction, 1 if it is the second best, ...). This small numbers can be encoded very efficiently using trivial old techniques. And boom: done.

So for instance:

> In my pasta I put a lot of [cheese]

LLM top N tokens for "In my pasta I put a lot of" will be [0:tomato, 1:cheese, 2:oil]

The real next token is "cheese" so I'll store "1".

Well, this is neat, but also very computationally expensive :D So for my small ESP32 LoRa devices I used this: https://github.com/antirez/smaz2 And so forth.

6 comments

I'm pretty sure it doesn't use ranking. That leaves a lot of performance on the table. Instead you would use the actual predicted token probabilities and arithmetic coding.
I supposed it used arithmetic coding with the ranking bacause they have a distribution easy to exploit: zero more likely, one a bit less and so forth. What's your guess? Unfortunately Bellard is as smart as hermetic. We are here guessing what should be a README file.
The model gives you a probability distribution over the tokens. You could use that directly with arithmetic coding, but there are ways to convert that to a distribution over e.g. the next byte instead which would improve efficiency further by removing the redundancy in alternative token encodings. ts_zip does this, and README says this works similar to ts_zip.

EDIT: Hm, or maybe ts_zip uses just the token probabilities directly. I thought it was slightly more efficient about it.

"The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities."

Oh, that makes sense! So they use the probability of the next token itself. Thanks for clarifying. Also clever trick about the multiple potential tokens to represent the same text.
antirez, it's probably identical to the approach in this paper: Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression" (https://ar5iv.labs.arxiv.org/html//2402.00861).

There's a pretty straight line from assigning probabilities (to a sequence of tokens) to arithmetic compression as an optimal compression algorithm for that distribution.

If you are going to zip the resulting file, it may be useful to have a lot of 0s.

If you are going to send the result as is, Huffman coding (with some escape for unusal words(?)) will be better. I think even better than the other method that forgets the probabilities and then tries to compresd it.

Just to clarify: even storing ranking, here would likely produce good results, but not as good as storing the probability, since it exploits better the ability of arithmetic coding to store this fractional intervals. But here the fundamental trick is that the LLM can compress the "next in sequence" information in a distribution that is much better to compress than the initial data itself.
This is especially true for instance when you have two or more tokens that are about equally likely, or one token that is virtually certain, which ranking would obscure.
Indeed.
Arithmetic coding is better than Huffman coding because it can use a fractional number of bits per code, while Huffman has to use a whole number of bits.

IIRC the only reason it wasn't always used for everything is patents. A secondary reason being that it can be slow if you design it without thinking about performance, eg if you use divisions.

Huffman still has a performance edge for static distributions. ANS bridges some of the performance gap between arithmetic coding and huffman.
But there's no such thing as a static distribution :)
This is very similar to how many compression schemes work. Look up Huffman coding to begin with.

https://en.wikipedia.org/wiki/Huffman_coding

For anyone interested in this topic, Primeagen has a pretty great video on how he used several encoding schemes to save bandwidth in one of his projects.

https://www.youtube.com/watch?v=3f9tbqSIm-E

Seems like an ideal compression method for LoRa/Meshtastic-style communication. An LLM wouldn't run on an ESP32, but there are several that could run on a raspberry pi.

It's not just natural language that could be compressed this way, either. Code (HTML, JS, etc) could be compressed with the same technique/models. I bet that the same general idea could work for image compression as well, using an image/diffusion model (or perhaps a multimodal model for everything.)

This could lead to an entire internet of content using just a few bits.

The key insight is that the larger the shared context between parties, the more efficient communication can be, as communication tends towards a purely relational construct. The limit of this is two parties that both share the exact same context and inputs, the inputs should produce the same hidden state within both parties and communication is not even necessary because both parties have the same knowledge and state.

That's not new to anyone familiar with compression or information theory, but the novelty here is the LLM itself. It's absolutely plausible that, given an already highly compressed relationally-encoded context like a trained LLM, very few bits could be communicated to communicate very abstract and complex ideas, letting the LLM recontextualize information which has been compressed across several semantic and contextual layers, effectively leveraging a complete (but lossy) history of human knowledge against every single bit of information communicated.

LORA is also pretty slow, like the 'long fast' mode that most meshtastic users use is about a kilobit per second... and presumably a small percentage of the traffic at any time is traffic in channels that you're monitoring.

Probably decoding few tokens per second is fast enough to deliver more goodput than the existing uncompressed usage.

It'd be a fun experiment to try making it lossy.

You could adjust tokens towards what's more statistically probable, and therefore more compressible (in your example, it'd be picking tomato instead of cheese)

I could see that as a plot point in a science fiction story: Intergalactic telegrams are prohibitively expensive, so before sending one you're offered various variants of your text that amount to the same thing but save data due to using more generic (per zeitgeist) language :)

Compare also with commercial code [1], a close historical analog, albeit with handcrafted, as opposed to ML-derived, compression tables. (There was a single code point for "twins, born alive and well, one boy and one girl", for example! [2])

[1] https://en.wikipedia.org/wiki/Commercial_code_(communication...

[2] https://archive.org/details/unicodeuniversa00unkngoog/

Comes up as a minor plot point in Vernor Vinge's The Blabber (1988):

> “And from your standpoint, Hamid, there’s one big drawback. The mean bandwidth of this thing [an ansible, more or less] is just under six bits per minute.”

> “Huh? Ten seconds to send a single bit?”

> “Yup. Skandr left three protocols at the Lothlrimarre end: ASCII, a Hamming map to a subset of English, and an AI scheme that guesses what you’d say if you used more bits. The first is Skandr’s idea of a joke, and I wouldn’t trust the third more than wishful thinking.”

(Good advice at the end there.)

A bit of an aside- in one of the sequels to A Fire Upon the Deep, somebody has to interpret some very lossy audio and video into the most likely explanation, but they are stuck with a stupider than usual AI and it misinterprets the results (it's implied if they had their full AI it would have gotten the interpretation correct even with the ambiguity). This episode in the book completely changed how I think about imputation under uncertainty. Often, I don't want a single high confidence prediction, I want a probability distribution of the most likely predictions, rank ordered.
Fire has evocations, which are videos compressed down to something like just a description, then rendered at the receiving end in a way that hopefully has some resemblance to the original.

One viewer stumbles onto a key insight about the struggle taking place, but they only have evocations so they’re not sure. And they sound like a total kook so everyone ignores them.

I can't find an exact reference and I don't want to spoil too much, I think this was in Children of the Sky, at some point one of the wolf creatures is imprisoned and somebody uses the ship's reduced AI to spy on them.
I shudder to think how current LLMs would go with this. I guess we can currently do this easily for still images and audio. Kind of reminds me of Translation Party.
Yep. For lossy what could work even better is an encoder-decoder model, so that it is possible to just save the embedding, and later the embedding will be turned back into the meaning.
I've tried to build sort of model several times, but could never get it to work. The challenge is that small perturbations in encoder space lead to removing semantically important details (e.g. dates). You really want these to mess up syntax instead to get something more analogous to a lossy video encoder.
I built a lossy text compressor in the days before LLMs.

I used a word embedding to convert the text to a space where similar tokens had similar semantic meaning, then I modified an ordinary LZ encoder to choose cheaper tokens if they were 'close enough' according to some tunable loss parameter.

It "worked", but was better at producing amusing outputs than any other purpose. Perhaps you wouldn't have considered that working!

In terms of a modern implementation using an LLM, I would think that I could improve the retention of details like that by adapting the loss parameter based on the flatness of the model. E.g. for a date the model may be confident that the figures are numbers but pretty uniform among the numbers. Though I bet those details you want to preserve have a lot of the document's actual entropy.

Yep, makes sense... Something like 20 years ago I experimented with encoder/decoder models for lossy images compression and it worked very well, but it's a completely different domain indeed, where there aren't single local concentration of entropy that messes with the whole result.
I guess text in images would be similar, and is indeed where image generation models struggle to get the details right.

E.g., making a greeting card with somebody's name spelled correctly.

Seems like an opportunity to do some steganography. If the model isn't known by the attacker (or perhaps they can be "salted" to become unknown) then the actual message can be encoded in the offsets.

This would be much nicer than text-in-image steganography because services often alter images before displaying them, but they rarely do that to text (assuming the usual charset and no consecutive whitespace).

There's already research into stenography for LLM generated text for fingerprinting and identifying source: https://www.nature.com/articles/s41586-024-08025-4

The idea seems similar enough that I wanted to share. The same way you can hide information in the text to prove it was generated by a specific model and version, of course you can use this for secrets as well.

tbh I'm not sure this would qualify as steganography - the message doesn't exist at all in the encoded form. It's not hidden, it's completely gone, the information is now split into two pieces.

So it's cryptography. With a shared dictionary. Basically just ECB, though with an unbelievably large and complicated code book.

Realistically, 1) the model will be one of a small number of available models which can be tested against, 2) LLMs converge on similar predictions, so even an "unknown" model can hardly be considered cryptographic. The advantage of using an LLM this way is the ability to hide an information stream in an innocuous looking text, in the form of subtle deviations from the most likely next word. Hence, steganography.

Incidentally, this technique is actually an old one that has already seen use in magic shows and confidence tricks; two people who wish to communicate covertly, such as a magician and an "audience member", can exchange small amounts of information by varying the wording of simple responses: "yes", "that's right", "quite so". This can be used to, for instance, discreetly reveal the value of a card that the magician is "guessing" through "ESP".

I may be abusing some definition or another, but I'd say that if the primary design goal is that your cyphertext can masquerade as cleartext, "steganography" scratches the itch pretty well, if not precisely.
48298346,1,3,2,3,1,2,3 doesn't really masquerade as cleartext.

you could hide that as text in other text, and that'd be steganography.

Sorry I wasn't very complete with my description. I mean that 0,0,0,0... would correspond with the "most probable" continuation of some prompt and it would map to sensical english. And then 48298346,1,3,2... would correspond with a less probable continuation of the prompt, but it would also map to sensical english. But where more vs less probable, and the associated probabilities, are only knowable by someone with access to the secret LLM.

So you'd feed the algorithm some starter text like: "Here's my favorite recipe for brownies", and then you'd give it some data to encode, and depending on which data you gave it, you'd get a different, but "plausible", recipe for brownies. The recipient could reverse the recpie back into numbers, and from that they'd decode the hidden message.

The trick would be balancing the LLM's attempt to make sense against whatever additional constraints came along with your data encoding scheme. If you tried to encode too much cyphertext into a too-short brownies recipe, the recipe would fail to be convincingly a recipe. Conveniently, it's conventional to prefix recipes with a tremendous amount of text that nobody reads, so you've got a lot of entropy to play in.

oooh, yeah, I was thinking about it backwards. sorry about that! yeah I'd agree that's steganography.

I would definitely expect something like this to happen at some point. as long as people use LLMs with a non-zero temperature, you'd expect variation from anyone's output, so it'd be super hard to detect / super deniable with just the output.

very computationally expensive

The same goes for all the other higher-order probability models, which are used in what is currently the best known compression algorithm:

https://en.wikipedia.org/wiki/PAQ

LLMs are just another way to do the probability modeling.