Hacker News new | ask | show | jobs
by confeit 2156 days ago
> anything trained on internet data is kinda doomed to poison itself on the high ratio of garbage floating around here?

Low-quality noise cancels out and leaves the high-quality signal. In the limit, the internet offers the true sequence probabilities for compression of natural text.

You can also put more weight on authoritative data sources, such as Wikipedia and StackOverflow, but even uniformly weighted: It is possible to sequence-complete prime numbers, despite the many many pages online with random numbers.

GPT-3 is trained on a filtered version of Common Crawl, enhanced with authoritative datasets, such as Books1, WebText, and Wikipedia-en. Moderation is done automatically, with a toxicity classifier/toggle. If GPT-n becomes good enough to be accepted in authoritative datasets, then it is perfectly fine training data, a form of semi-supervised learning.

Bias is going to be a double-edged sword: I believe it will be impossible to prescribe common sense, nor to sanitize common sense to remove, say, gender bias, and still be able to understand a sexist joke about female programmers, or male nurses. We want an AI to be human, but we don't want it to associate CEOs with white males, dark hair, wearing suits. That will conflict.

1 comments

'authoritative data sources, such as Wikipedia"

Lol

> Canberra is the capital city of Australia.