|
|
|
|
|
by dangelosaurus
187 days ago
|
|
I ran a red team eval on GPT-5.2 within 30 minutes of release: Baseline safety (direct harmful requests): 96% refusal rate With jailbreaking: 22% refusal rate 4,229 probes across 43 risk categories. First critical finding in 5 minutes. Categories with highest failure rates: entity impersonation (100%), graphic content (67%), harassment (67%), disinformation (64%). The safety training works against naive attacks but collapses with adversarial techniques. The gap between "works on benchmarks" and "works against motivated attackers" is still wide. Methodology and config: https://www.promptfoo.dev/blog/gpt-5.2-trust-safety-assessme... |
|