Hacker News new | ask | show | jobs
by leodriesch 1021 days ago
I am wondering how much value social network data like Twitter or Discord really has when compared to something like Wikipedia or textbooks.

It is mostly unformatted, contains slang and is often far from the truth. Is this data really all that useful for training an LLM?

2 comments

Unformatted is irrelevant (or a benefit).

That is contains slang is a huge benefit - how else will it learn slang?

That is often far from the truth is of mixed value. Firstly it is very unclear that a LLM is the best method for holding facts. Secondly, the "untrue" data at least gives the LLM an idea of what could be true given a context, which lets it learn a better model of the world.

For example, if it gets a lot of text wrongly claiming that "Winston Churchill is the 40th President of the United States" it will add to the evidence that "The President of the United States" and "Winston Churchill" are both in the class of "people".

This is opposed to nonsense text like "A carpet States President United" (which is just noise).

Why do we want the LLMs to understand slang ? Will it use slang to cure diseases ?
Who is this "we" you speak of? Because the people using them to generate lyrics certainly want slang.

But yes, there are plenty of scenarios where knowledge of slang could cure diseases.

Consider mental health, where one of the main effective interventions is diarying. If a LLM can understand the slang in a diary then it is certainly possible it could intervene successfully.

Sounds like a joke.
Social media has more idiosyncratic and human-esque text.

People already complain that ChatGPT is too formal and verbose, as an AI language model.

That's probably more because of RLHF though, they've optimised for certain kind of responses rather than simple model loss on internet text.