|
|
|
|
|
by wizzwizz4
231 days ago
|
|
But how do you know an input is adversarial? There are other issues: verdicts are arbitrary, the false positive rate means you'd need manual review of all the rejects (unless you wanted to reject something like 5% of genuine research), you need the appeals process to exist and you can't automate that, so bad actors can still flood your bureaucracy even if you do implement an automated review process… |
|
> But how do you know an input is adversarial?
Prompt injection and jailbreaking attempts are pretty clear. I don't think anything else is particularly concerning.
> the false positive rate means you'd need manual review of all the rejects (unless you wanted to reject something like 5% of genuine research)
Not all rejects, just those that submit an appeal. There are a few options, but ultimately appeals require some stakes, such as:
1. Every appeal carries a receipt for a monetary donation to arxiv that's refunded only if the appeal succeeds.
2. Appeal failures trigger the ban hammer with exponentially increasing times, eg. 1 month, 3 months, 9 months, 27 months, etc.
Bad actors either respond to deterrence or get filtered out while funding the review process itself.