| HN Mirror

I'll play advocatus diaboli for once here.

Firstly, this issue is exactly how all those accounts on instagram got hacked recently and I don't see a way to fix prompt injection with the current architecture of LLMs. I strongly suspect it is entirely impossible to achieve.

But, that doesn't mean that all useful actions are forbidden. The important part is identifying maximum and minimum harms. I lean towards LLMs for simple NLP tasks like detecting obvious spam, because even if it is completely wrong the worst case is that a spam message gets through or a valid one gets sent to spam - two issues we already routinely deal with anyway.