Hacker News new | ask | show | jobs
by nullc 1120 days ago
Right, so in total ignorance what I would do is just make a GPT powered text compressor and decompressor using a range coder with its probabilities at each token set by the model. Care would need to be taken with precision and end termination so a bias wouldn't be introduced (more care than typical in compressors, since no one normally cares much about 0.01% inefficiencies-- things like end termination have been addressed by bijective range coders).

Then to use it for stego just take your headerless ciphertext and decompress it. Tada: you get model output. To decode the stego just compress it. Assuming everything was built right and that your ciphertext is uniform, the output should be indistinguishable from the model just sampling using an RNG.

As a bonus, you don't even have a stego tool on your computer you just have a particularly carefully constructed text compressor and decompressor that is perfectly usable (and even state of the art in compression rate, given a big enough model) for the compression application.

1 comments

Hi, I am one of the authors on the PSSuMEC paper. Thanks a lot for your interest in our work. If I understand correctly, range coding would have the same lack of perfect security properties as arithmetic coding (cf our paper). Having said this, we are investigating the utility of iMEC in compression settings.
Ah, yeah I was trying to get a lay explanation of how arithmetic coding assuming it was done with enough precision wouldn't achieve the perfect security properties.

It might just be that the precision rapidly becomes unmanageable, because I guess you need to support the least probable symbol from every token without ever losing precision (normally arithmetic coders will re-normalize after each symbol to keep the accumulator in a reasonable precision, though they could be designed to do so as infrequently as you like at a performance cost)... If no renormalization is possible I guess an arithmetic coder accumulator would need to handle values like (1/least_prob_token)^n_tokens which gets extremely big extremely fast -- and would at the very least need an unconventional construction.