|
|
|
|
|
by leodriesch
1021 days ago
|
|
I am wondering how much value social network data like Twitter or Discord really has when compared to something like Wikipedia or textbooks. It is mostly unformatted, contains slang and is often far from the truth. Is this data really all that useful for training an LLM? |
|
That is contains slang is a huge benefit - how else will it learn slang?
That is often far from the truth is of mixed value. Firstly it is very unclear that a LLM is the best method for holding facts. Secondly, the "untrue" data at least gives the LLM an idea of what could be true given a context, which lets it learn a better model of the world.
For example, if it gets a lot of text wrongly claiming that "Winston Churchill is the 40th President of the United States" it will add to the evidence that "The President of the United States" and "Winston Churchill" are both in the class of "people".
This is opposed to nonsense text like "A carpet States President United" (which is just noise).