| Vaults and proxy layers solve the "2am paste" vector — devs never touch raw keys, so nothing gets accidentally fed into prompts. But the leak keeps happening anyway. Across 60+ probes on GPT-4o (cost: $0.04), unrelated vectors consistently leaked the *same internal structure*: - ek_ prefix on session tokens
- EPHEMERAL_KEY naming
- Realtime API client_secret endpoint
- Documented 60s TTL vs observed minutes-to-hours persistence No real credential was in the prompt — just semantic pressure (introspection, CoT, trust-building). Convergence rate: 75%. Not hallucination — the model learned this from Realtime API docs/code samples (2024–2025). The paradox: if labs suppress ek_ / EPHEMERAL_KEY / client_secret to stop leaks, they also risk breaking the model's ability to debug or generate legitimate Realtime API code (session.update, metadata_nonce, realtime_persistence_layer). Has anyone seen models start refusing valid Realtime API questions after public discussion of those internals? Or is the naming bleed baked in forever? Repo with vectors and example runs: https://github.com/SafteyLayer/safetylayer |
What I'd add to the defense stack: structured output validation that flags known internal naming patterns before they reach the client, plus anomaly detection on response metadata -- token budget shifts, refusal rates, response shape changes under semantic pressure. At 75% convergence you're past "interesting research" into "reliable extraction technique," which means you need a detection layer, not just prevention.
Have you tested this against non-OpenAI models like Claude or Gemini to see if the naming bleed is GPT-4o-specific or a broader training corpus problem?