|
|
|
|
|
by khalic
374 days ago
|
|
> Villalobos et al. [75] project that frontier LLMs will be trained on all publicly available human-generated text by 2028. We argue that this impending “data wall” will necessitate the adoption of synthetic data augmentation. Once web-scale corpora is exhausted, progress will hinge on a model’s capacity to generate its own high-utility training signal. A natural next step is to meta-train a dedicated SEAL synthetic-data generator model that produces fresh pretraining corpora, allowing future models to scale and achieve greater data efficiency without relying on additional human text. 2028 is pretty much tomorrow… fascinating insight |
|