| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by RC_ITR 1191 days ago

>We are getting close to the point where there isn’t enough human written text in existence to continue scaling these models.

People say this, but GPT-3 (the latest we know the details on) was 45TB of text, which may be most of the open Internet, but still lacks non-publicly-indexed Internet text (i.e. things behind paywalls, things behind log-in screens like emails), any book outside of Bibliotik's 200k books (remember when Google was randomly digitizing all books it could get its hands on?), and plenty of other non-digitized text.

OpenAI wants you to believe that we are running out of text, but even at Google, there's 100's of TB of text that OpenAI doesn't have access to (Google Books, Google Docs, Gmail, Search Queries, Archived pages beyond what CommonCrawl gets, Paywalled news articles that allow Google to crawl them, etc.).

Now the key question that GPT-4 will hopefully answer is "are bigger datasets really the key, or are larger context windows?"

If you're thinking of investing in/working for OpenAI, you better hope the answer is context windows.