| reality is infinite. a corpus of training data from the internet is finite. any finite number divided by infinity ends up tending towards zero. so, mathematically at least, the training data is not a sufficient sample of reality because the proportion of reality being sampled is basically always zero! fun with maths ;) > What exactly do you think those math pipelines represent? probability distributions of human language, in the case of text only LLMs. which is a very small subset of stuff in reality. - also, training data scraped from the public internet is a woeful representation of “reality” if you ask me. that’s why LLMs i think are bullshit machines. the systems are built on other people’s bullshit posted on the public internet. we get bullshit out because we made a bunch of bullshit. it’s just a feedback loop. (some of the training data is not bullshit. but there is a lot of bullshit in there). |
Since LLMs are directly based on that language, they are definitely based on and are a model of reality. Are they perfect? No. Are they limited? Yes. Are they "bullshit"? Only to someone who is judging emotionally.