Hacker News new | ask | show | jobs
by antirez 532 days ago
Yep. For lossy what could work even better is an encoder-decoder model, so that it is possible to just save the embedding, and later the embedding will be turned back into the meaning.
1 comments

I've tried to build sort of model several times, but could never get it to work. The challenge is that small perturbations in encoder space lead to removing semantically important details (e.g. dates). You really want these to mess up syntax instead to get something more analogous to a lossy video encoder.
I built a lossy text compressor in the days before LLMs.

I used a word embedding to convert the text to a space where similar tokens had similar semantic meaning, then I modified an ordinary LZ encoder to choose cheaper tokens if they were 'close enough' according to some tunable loss parameter.

It "worked", but was better at producing amusing outputs than any other purpose. Perhaps you wouldn't have considered that working!

In terms of a modern implementation using an LLM, I would think that I could improve the retention of details like that by adapting the loss parameter based on the flatness of the model. E.g. for a date the model may be confident that the figures are numbers but pretty uniform among the numbers. Though I bet those details you want to preserve have a lot of the document's actual entropy.

Yep, makes sense... Something like 20 years ago I experimented with encoder/decoder models for lossy images compression and it worked very well, but it's a completely different domain indeed, where there aren't single local concentration of entropy that messes with the whole result.
I guess text in images would be similar, and is indeed where image generation models struggle to get the details right.

E.g., making a greeting card with somebody's name spelled correctly.