| HN Mirror

Right, that’s why I wrote, “I don’t mean that most neural networks are invertible functions.”

For a neural network that is not bijective, you can obtain an input that maps to a desired output by the following algorithm.

1. Start with a trained neural network. (The weights will not change throughout this procedure.)

2. Pick a random input.

3. Given an output for which you want to compute an associated input, feed the input into the network to compute the output.

4. Compute the loss of the computed output relative to the target output (e.g., mean-square error). If the loss is sufficiently small, you’ve found an input that maps to an output close to your target output and you’re done.

5. Otherwise, compute the gradient of the loss with respect to the input (e.g., by backprop).

6. Update the input according to a gradient-update rule. And go back to Step 3.

In theory, you can recover a “representative” prompt for the output of an LLM in this manner. For outputs that could have been generated by a large set of disparate prompts, obviously this won’t work well.