Hacker News new | ask | show | jobs
by valine 1195 days ago
I think for now, the data requirements to train a SOTA LLM are so extreme we don’t have the luxury of being picky with the training data. We are getting close to the point where there isn’t enough human written text in existence to continue scaling these models.

Model refinement seemingly has lower training requirements, putting it within the reach of smaller organizations or wealthy individuals. If you don’t like the refinement dataset it will likely be feasible to bootstrap your own off someone else’s LLM. See what Stanford did with Alpaca.

2 comments

I'm waiting for a general correction mechanism, I don't even know what to call it. "NO, chatgpt, people usually have 5 fingers", and the gpt just learns, rather like a child. I keep thinking that's the next real step.
The problem is that, to the extent the analogy of ChatGPT to a living thing makes sense, the individual isn’t the model (that's just the common species-defining—or maybe “clone family” is better than “species”—set of instincts), the individual lifespan is the conversation.

You could share feedback across conversations by allocating prompt space to it, at the expense of limiting the size of the conversation, but you'd need a way to decide what to share this way.

You could also take the conversation and use it as part of the reinforcement learning dataset. I feel like that's the closest thing to long term memory ChatGPT is capable of right now.
I think what's mainly stopping that from happening is that GPT-4 doesn't remember older chats. If we make it remember everything ,it should get more personal and remember everything right?
The token limit is the problem, in general token limits can’t be changed after the model has been trained. Gpt4 has an exceptionally large 32k token limit, but even with 32k tokens you’d only get a few weeks of chat before the context window was full.

Not to mention the added cost of using the full 32k tokens. OpenAI is charging $0.12 a token which would quickly add up. It’s prohibitively expensive unless you have a very very compelling business use case.

Maybe trim chat history to most important content?
>We are getting close to the point where there isn’t enough human written text in existence to continue scaling these models.

People say this, but GPT-3 (the latest we know the details on) was 45TB of text, which may be most of the open Internet, but still lacks non-publicly-indexed Internet text (i.e. things behind paywalls, things behind log-in screens like emails), any book outside of Bibliotik's 200k books (remember when Google was randomly digitizing all books it could get its hands on?), and plenty of other non-digitized text.

OpenAI wants you to believe that we are running out of text, but even at Google, there's 100's of TB of text that OpenAI doesn't have access to (Google Books, Google Docs, Gmail, Search Queries, Archived pages beyond what CommonCrawl gets, Paywalled news articles that allow Google to crawl them, etc.).

Now the key question that GPT-4 will hopefully answer is "are bigger datasets really the key, or are larger context windows?"

If you're thinking of investing in/working for OpenAI, you better hope the answer is context windows.