Hacker News new | ask | show | jobs
by pash 17 days ago
The obvious solution is to run things in reverse, inputting the AI-generated output to recover the prompt that generated it.

Most generative models can be run in reverse by algorithms that already exist [0], but you have to have the model weights. For closed-weight models, or for a process that can handle unknown models, you’d have to do some engineering.

But do we have the technology to build models that back out the prompt from suspected AI output? Yes.

0. I don’t mean that most neural networks are invertible functions. They’re not. But you can do backprop in reverse, from output to input, to train a model to generate an input to the original model that best predicts its output.

1 comments

Most of the functions that LLMs perform aren’t bijective, though.

What prompt constructs the output ‘The answer is 3’ or ‘Yes that’s a great idea’?

Right, that’s why I wrote, “I don’t mean that most neural networks are invertible functions.”

For a neural network that is not bijective, you can obtain an input that maps to a desired output by the following algorithm.

1. Start with a trained neural network. (The weights will not change throughout this procedure.)

2. Pick a random input.

3. Given an output for which you want to compute an associated input, feed the input into the network to compute the output.

4. Compute the loss of the computed output relative to the target output (e.g., mean-square error). If the loss is sufficiently small, you’ve found an input that maps to an output close to your target output and you’re done.

5. Otherwise, compute the gradient of the loss with respect to the input (e.g., by backprop).

6. Update the input according to a gradient-update rule. And go back to Step 3.

In theory, you can recover a “representative” prompt for the output of an LLM in this manner. For outputs that could have been generated by a large set of disparate prompts, obviously this won’t work well.