Hacker News new | ask | show | jobs
by Der_Einzige 1292 days ago
This and related techniques are trivially foolable by fine-tuning the model.

They're also trivially foolable by using sampling techniques or settings which encourage the model to generate rare words a lot.

Also foolable with filter-assisted decoding: https://paperswithcode.com/paper/most-language-models-can-be...