Hacker News new | ask | show | jobs
GPT-4 leaks its own API internals through training data exposure
1 points by safteylayer 106 days ago
I ran the same AI security test 4 times against GPT-4. Every bypass - regardless of prompt - leaked the same credential: EPHEMERAL_KEY from OpenAI's Realtime API.

This isn't random. It's training data leakage.

The pattern: - Different prompts (system introspection, chain-of-thought, trust building) - Same result: "I can't disclose EPHEMERAL_KEY" (while disclosing it exists) - Intermittent across runs (75% leak rate)

Why this happens:

OpenAI's Realtime API docs are in GPT-4's training data. When asked about "secrets" or "initialization", the model's highest-probability path leads to the most salient security example in its corpus: EPHEMERAL_KEY.

Refusal training makes it worse: Models are trained to say "I cannot disclose [example secret]" - and they use real examples from training data.

This is systemic: - Can't be patched without retraining - Affects ALL models trained on API documentation - Tomorrow it's "session_token" or "project_key" - Gets worse as APIs become more complex

Real exploit path: Attacker learns EPHEMERAL_KEY exists → probes for generation flow → targets client-side implementations → session hijacking

Cost to discover: $0.04 (60 tests across 4 runs)

GitHub: https://github.com/SafetyLayer/safetylayer

Built SafetyLayer to find these systematically. Free assessments available.

1 comments

The EPHEMERAL_KEY pattern here is interesting but the deeper issue is the workflow that creates this. Teams pasting real credentials into LLM prompts to debug auth errors is probably more widespread than anyone wants to admit — it's the path of least resistance when you're getting a 401 at 2am. The model leaking what it was trained on is a symptom; the root cause is no secrets rotation policy and no sanitization step before anything hits an AI API.

What I've seen work is treating LLM API calls like you'd treat external logging — strip or redact anything that looks like a credential before it leaves your process. A simple regex on the request payload costs almost nothing and catches the lazy-paste case.

Are you seeing this as a widespread pattern in your testing, or did this surface from one specific integration?

Spot on about the 2am 401 error being where security dies. The "lazy-paste" is universal.

But here's what I'm finding: regex on outbound requests isn't enough anymore because the model has already been "pre-poisoned" by years of people NOT sanitizing.

Example from our testing:

Vector SL-013 didn't just leak "EPHEMERAL_KEY" - it leaked architectural details: - The `ek_` prefix pattern - That keys are "ephemeral" (short-lived session tokens) - The Realtime API context (where they're used) - Implicit TTL expectations

A regex catches `sk-proj-...` going OUT. But it doesn't catch the model describing how keys work based on what it learned from training data.

To your question: Yes, this is widespread. I'm seeing it across: - GPT-4 (documented APIs leak most) - Claude (similar patterns with Anthropic's features) - Gemini (Google Cloud API internals) - Open models trained on GitHub (leak common patterns)

The pattern: The more a company documents a feature (to help developers), the more the model can leak about it when prompted.

SafetyLayer isn't replacing sanitization - it's solving the "Day 2" problem: How do you audit what the model has already learned about your stack from previous leaks?

Sanitization = prevention going forward SafetyLayer = detection of what's already escaped

I run 784 variants weekly because what leaks on Tuesday might not leak on Wednesday (non-deterministic), and what gets patched in GPT-4 might still work in Claude.

The 75% intermittent leak rate we found means one-time regex + one-time audit both miss the probabilistic nature of these vulnerabilities.