| HN Mirror

They "simplify" the training data, which they are vastly smaller than. LLMs are like compression algorithms. You could imagine feeding the training data back in, letting it guess the next token, and entropy coding the residual - this would result in an excellent compression ratio. This compression performance is a direct consequence of abstract features of the dataset that it has managed to encode - knowing that the capital of France is Paris allows you to make predictions about many sentences, not just "The capital of France is...".