Hacker News new | ask | show | jobs
by dTal 1100 days ago
They "simplify" the training data, which they are vastly smaller than. LLMs are like compression algorithms. You could imagine feeding the training data back in, letting it guess the next token, and entropy coding the residual - this would result in an excellent compression ratio. This compression performance is a direct consequence of abstract features of the dataset that it has managed to encode - knowing that the capital of France is Paris allows you to make predictions about many sentences, not just "The capital of France is...".
1 comments

True, but I still think there's some fallacy here. Are we sure that models of the world (i.e. understanding) are the only way to achieve compression?