|
|
|
|
|
by thestructuralme
148 days ago
|
|
The “snake eating its own tail” frame is real, but it’s not mystical — it’s incentives + sampling. If the web gets flooded with LLM output and you train on it naively, you’re effectively training on your own prior. That pushes models toward the mean: less surprise, less specificity, more template-y phrasing. It’s like photocopying a photocopy: the sharp edges disappear. The fix isn’t “never use synthetic data.” It’s to treat it like a controlled ingredient: tag provenance, keep a high-quality human/grounded core, filter aggressively, and anchor training to things that don’t self-contaminate (code that compiles/tests, math with verifiable proofs, retrieval with citations, real user feedback). Otherwise the easiest path is content volume, and volume is exactly what kills signal. |
|
Humans are amazing machines that reduce insane amounts of complexity in bespoke combinations of neural processors to synthesize ideas and emotions. Even Ilya Sutskever has said that he wasn't and still isn't clear at a formal level why GPT works at all (e.g. interpretability problem), but GPT was not a random discovery, it was based on work that was an amalgamation of Ilya and others careers and biases.