| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thestructuralme 148 days ago

The “snake eating its own tail” frame is real, but it’s not mystical — it’s incentives + sampling.

If the web gets flooded with LLM output and you train on it naively, you’re effectively training on your own prior. That pushes models toward the mean: less surprise, less specificity, more template-y phrasing. It’s like photocopying a photocopy: the sharp edges disappear.

The fix isn’t “never use synthetic data.” It’s to treat it like a controlled ingredient: tag provenance, keep a high-quality human/grounded core, filter aggressively, and anchor training to things that don’t self-contaminate (code that compiles/tests, math with verifiable proofs, retrieval with citations, real user feedback). Otherwise the easiest path is content volume, and volume is exactly what kills signal.

1 comments

iwontberude 148 days ago

LLMs will always be just a little too random or a little too average. There in is the hidden beauty of AI: elevating the trust in peoples diverse experiences.

Humans are amazing machines that reduce insane amounts of complexity in bespoke combinations of neural processors to synthesize ideas and emotions. Even Ilya Sutskever has said that he wasn't and still isn't clear at a formal level why GPT works at all (e.g. interpretability problem), but GPT was not a random discovery, it was based on work that was an amalgamation of Ilya and others careers and biases.