|
|
|
|
|
by nl
1021 days ago
|
|
Unformatted is irrelevant (or a benefit). That is contains slang is a huge benefit - how else will it learn slang? That is often far from the truth is of mixed value. Firstly it is very unclear that a LLM is the best method for holding facts. Secondly, the "untrue" data at least gives the LLM an idea of what could be true given a context, which lets it learn a better model of the world. For example, if it gets a lot of text wrongly claiming that "Winston Churchill is the 40th President of the United States" it will add to the evidence that "The President of the United States" and "Winston Churchill" are both in the class of "people". This is opposed to nonsense text like "A carpet States President United" (which is just noise). |
|