Hacker News new | ask | show | jobs
by lunixbochs 1903 days ago
I implemented a "conceptual" lossy text compression a few months ago by using word2vec (glove) to convert each word to a list of floats, laid the result out in a square, then jpeg compressed it at different quality levels. This ensured the decompressed output was all real words, but lowering the jpeg quality basically controlled the descent into synonyms and madness.

This was a follow up to a similar demo where a colleague compressed the raw ascii bytes as jpeg, which started devolving into nonsense (that didn't even look like words anymore) below ~97 quality.

I'm away from my computer for the next week or so but if you catch me after that I'm happy to post some of the results. Would be interesting to try with some other image formats as well.

--

Could also play with the data layout to make the jpeg curves make sense in a different way. Right now the vectors are stacked end to end, like:

  v0 = vec(words[0])
  v1 = vec(words[1])
  out = v0[0...len] + v1[0...len]
If you consider the vector to be an N dimensional space, jpeg compression probably makes more sense if you lay it out like:

  out = v0[0] + v1[0] ... v0[len-1] + v1[len-1]
This would allow the DCT to apply to continuous coordinate spaces from the vectors.

--

Another idea that would possibly have even more sensible output at good compression ratios, would be to train a language model neural network encoder/decoder pair on a text compression task. Teach it to "downscale" and "upscale" text, perhaps in an unsupervised or semi-supervised manner. Then you can also constrain the intermediate form.

1 comments

Here's a quick zero shot text compression attempt with GPT3, compressing then decompressing some of my sentences picked from upthread. Looks pretty impressive imo.

Notes: These results include every attempt I made and are not curated in any way, except for the "image formats" compression when I manually asked GPT3 for additional output because it stopped short.

To construct the same request, copy the appropriate prompt and append two more lines matching the prompt, such as `{prompt}\n\nLong: your input text\nShort:` then ask GPT3 for a completion.

--

Compression:

Prompt

  Make text shorter or longer, using slang, chat speak, or just add or remove vowels and silent letters.

  Long: This text is compressed.
  Short: dis txt is shrt
Outputs (each prediction starts after Short:)

Long: I want this sentence, only shorter.

Short: I wnt dis sentnce, only shrr

Long: This ensured the decompressed output was all real words, but lowering the jpeg quality basically controlled the descent into synonyms and madness.

Short: dis enrvd th cmpd outut b all rl wrds, but lwg th jpg qulity basically cntrlwd th dscnt into synns and mds.

Long: I'm away from my computer for the next week or so but if you catch me after that I'm happy to post some of the results. Would be interesting to try with some other image formats as well.

Short: I'm awy frm my cmpu 4 da nex wk or so bt if u ctch me aftr dat I'm hppy 2 post sum of d rzltz. Wd b intrestng 2 try w/ sum odder imjd frms as wl.

--

Now I take the above outputs (without the inputs) and feed them back into GPT3 with a different prompt.

Decompression:

Prompt

  Make text longer. Expand slang and chat speak, or just add vowels, silent letters, and punctuation.

  Short: dis txt is shrt
  Long: This text is compressed.

  Short: I wnt dis sentnce, only shrr
  Long: I want this sentence, only shorter.
Outputs (each prediction starts after Long:)

Short: dis enrvd th cmpd outut b all rl wrds, but lwg th jpg qulity basically cntrlwd th dscnt into synns and mds.

Long: This enhanced the compressed output by all real words, but long the JPEG quality basically controlled the descent into synonyms and made-up words.

Short: I'm awy frm my cmpu 4 da nex wk or so bt if u ctch me aftr dat I'm hppy 2 post sum of d rzltz. Wd b intrestng 2 try w/ sum odder imjd frms as wl.

Long: I'm away from my computer for the next week or so but if you catch me after that I'm happy to post some of the results. Would be interesting to try with some other image forms as well.