|
You can also consider the chatGPT app as a RL environment. The environment is made of the agent (AI), a second agent (human), and some tools (web search, code, APIs, vision). This grounds the AI into human and tool responses. They can generate feedback that can be incorporated into the model by RL methods. Basically every reply from a human can be interpreted as a reward signal. If the human restates the question, it means a negative reward, the AI didn't get it. If the human corrects the AI, another negative reward, but if they continue the thread then it is positive. You can judge turn-by-turn and end-to-end all chat logs with GPT4 to annotate. The great thing about chat based feedback is that it is scalable. OpenAI has 100M users, they generate these chat sessions by the millions every day. Then they just need to do a second pass (expensive, yes) to annotate the chat logs with RL reward signals and retrain. But they get the human-in-the-loop for free, and that is the best source of feedback. AI-human chat data is in-domain for both the AI and human, something we can't say about other training data. It will contain the kind of mistakes AI does, and the kind of demands humans want to solve with AI. My bet is that OpenAI have realized this and created GPTs in order to enrich and empower the AI to create the best training data for GPT-5. The secret sauce of OpenAI is not their people, or Sam, or the computers, but the training set, especially the augmented and synthetic parts. |