|
> Without specification, we employ a decoder-only language model GPT2 (Radford et al., 2019) with a configuration of 4 layers, 32 hidden dimensions, and 4 attention
heads. Yeah, ok. The research is interesting, warranted, but writing an article about it, and leading with the conclusions gathered from toy models and implying this generalises to production LLMs is useless. We've been here before with small models. Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results. Research in this area is good, and needed. Mainly to understand limitations, discover if there are any scale levels where "emergent" stuff appears and so on. But writing articles based on incipient research, based on tiny models is not worth the effort. |