Hacker News new | ask | show | jobs
by Eridrus 457 days ago
I applaud you for trying to tackle this, but after reading your docs a little I am skeptical of your approach.

Your example of having a rule saying that the user's email address is not put in a search query seems to have two problems: a) non-LLM models can be bypassed by telling LLMs to encode the email tokens, trivially ROT13, or many other encoding schemes b) the LLM checkers suffer from the same prompt injection problems

In particular, gradient-based methods are unsurprisingly a lot better at defeating all the proposed mitigations, e.g. https://arxiv.org/abs/2403.04957

For now I think the solutions are going to have to be even less general than your toolkit here.