|
|
|
|
|
by eru
745 days ago
|
|
You seem to be making some weird assumptions? Here's how I would do this: Use some LLM, the weights need to be know to both parties in the communication. Producing text with the LLM means repeatedly feeding the LLM with the text-so-far to produce a probability distribution for the next token. You then use a random number generator to pick a token from that distribution. If you want to turn this into steganography, you first take your cleartext and encrypt it with any old encryption system. The resulting bistream should be random-looking, if your encryption ain't broken. Now you take the LLM-mechanism I described above, but instead of sampling via a random number generator, you use your ciphertext as the source of entropy. (You need to use something like arithmetic coding to convert between your uniformly random-looking bitstream and the heavily weighted choices you make to sample your LLM. See https://en.wikipedia.org/wiki/Arithmetic_coding) Almost any temperature will work, as long as it is known to both sender and receiver. (The 'temperature' parameter can be used to change the distribution, but it's still effectively a probability distribution at the end. And that's all that's required.) |
|
That being said, yes, some of my assumptions were incorrect, mainly regarding temperature. For practical reasons I was envisioning this being implemented with a third party LLM (i.e. OpenAI's,) but I didn't realize those could have their RNG seeded as well. There is the security/convenience tradeoff to consider, however, and simply setting the temperature to 0 is a lot easier to coordinate between sender and receiver than adding two arbitrary numbers for temperature and seed.
I misspoke, or at least left myself open to misinterpretation when I referred to the LLM's weights as a "secret key"; I didn't mean the weights themselves had to be kept under wraps, but rather I meant that either the weights had to be possessed by both parties (with the knowledge of which weights to use being the "secret") or they'd have to use a frozen version of a third party LLM, in which case the knowledge about which version to use would become the secret.
As for how I might take a first stab at this if I were to try implementing it myself, I might encode the message using a low base (let's say binary or ternary) and make the first most likely token a 0, the second a 1, and so on, and to offset the risk of producing pure nonsense I would perhaps skip tokens with too large a gulf between the probabilities for the 1st and 2nd most common tokens.