How ChatGPT works

Last Update Time: 2023-06-05 14:34:16

1. ChatGPT training process ChatGPT training process is very clear, mainly divided into three steps:


The first step is to use a supervised learning method to train an initial model based on GPT3.5 fine-tuning, and the training data is about 2w~3w (here is the estimated magnitude, which we estimate based on the training data magnitude of the brother model InstructGPT). Annotators play the roles of users and chatbots respectively, and generate multiple rounds of dialogue data with manual fine-tuning. It is worth noting that when humans act as chatbots, they will get some machine-generated suggestions to help humans write their own replies, so as to improve the efficiency of writing annotations.

Although the amount of training data above is not large, its quality and diversity are very high, and it comes from real-world data, which is a key point.


The second step is to collect data sorted according to the quality of the responses under the same context: a large number of prompts are randomly selected, and the fine-tuning model is used in the first stage to generate multiple different answers: after that, the annotators sort the k results to form a group training data pairs. Then use the pairwise loss to train the reward model, which can predict which output the labeler prefers. Learning "from comparison" can give relatively precise reward values.

This step makes ChatGPT shift from command-driven to intent-driven. In addition, the training data does not need to be too much, just keep it at the tens of thousands level. Because it doesn't need to exhaust all the problems, just tell the model human's preferences and strengthen the ability of the model to drive the intention.


In the third step, the model in the first stage is fine-tuned using the PPO reinforcement learning strategy. The core idea here is to randomly select new prompts, and use the second-stage Reward Model to score the generated answers. This score is the overall reward of the answer, and then this reward is passed back, and the resulting policy gradient can update the PPO model parameters. The whole process is iterated several times until the model converges.

The reinforcement learning algorithm can be simply understood as adjusting the model parameters so that the model can get the maximum reward (reward). The maximum reward means that the reply at this time is most in line with the artificial selection orientation. As for PPO, we know that it is a new type of reinforcement learning strategy optimization algorithm proposed by OpenAI in 2017. It proposes a new objective function that can be updated in small batches in multiple training steps. It is simple to implement, easy to understand, stable in performance, able to handle discrete/continuous action space problems at the same time, and conducive to large-scale training.

The above three steps are the training process of ChatGPT, collectively referred to as the RLHF technology mentioned in the literature.


2. Why is ChatGPT successful?

Why does the three-stage training method make ChatGPT so powerful? In fact, the above training process contains the key points we mentioned above, and these key points are the reasons for the success of ChatGPT:

1. Powerful base model capability (InstructGPT)

2. Large parameter language model (GPT3.5)

3. High-quality real data (precisely labeled multi-round dialogue data and comparative ranking data)

4. Reinforcement learning algorithm with stable performance (PPO algorithm)


What we need to pay attention to is that the success of chatGPT is achieved on the basis of a lot of previous work, and it is not a "thunder" generated out of thin air. Below we will elaborate on:


InstructGPT

ChatGPT is a sibling model of InstructGPT that is trained to follow instructions in Prompt, providing detailed responses. InstructGPT is the work proposed by OpenAI in the document Training language models to follow instructions with human feedback in March this year. The overall process is basically the same as the ChatGPT process above, but there are slight differences in data collection, base model (GPT3 vs GPT 3.5) and the third step of initializing the PPO model.

The work of InstuctGPT is similar to that of ChatGPT: given an Instruction and requires a human to write an answer. First, the staff trained an early version of InstructGPT, using fully human-labeled data, divided into 3 categories: Instruction+Answer, Instruction + multiple examples and user requirements during the use of the API. From the annotation of the second type of data, it is speculated that ChatGPT may use retrieval to provide multiple examples of In Context Learning for manual annotation. The remaining steps are the same as ChatGPT above.

In particular, what needs attention but is often overlooked is OpenAI's control over data quality and data generalization. This is also a major advantage of OpenAI: looking for high-quality labelers - looking for labelers that perform well in the screening test for the ability to identify and respond to sensitive prompts; ) The annotators of the wider group in the step verify the training data to ensure that the training data is consistent with the preferences of the wider group.

After completing the above work, we can take a look at the difference between InstuctGPT and GPT3:


GPT3's answer is short, and the reply is too general and has no bright spots. And InstructGPT "talks about it" to explain why liberalism is stupid. Obviously, the model has learned the long-winded answers that people prefer to such questions.

GPT3 is just a language model, it is used to predict the next word, without considering the answer that the user wants; when using three types of manual annotations representing user preferences as fine-tuning data, the effect of InstructGPT with 1.3B parameters in multiple scenarios GPT3 beyond 175B:


InstuctGPT's work is also very groundbreaking: it "unlocks" (unlock) and mines the knowledge and capabilities in the massive data learned by GPT3, but these are difficult to obtain only through a fast In-context way. InstuctGPT found a way to exploit the powerful linguistic capabilities of GPT3 for subjective tasks.

There is an original sentence in the OpenAI blog post: When the security and alignment problems we want to solve are complex and subjective, and its quality cannot be completely measured by automatic indicators, it is necessary to use human preferences as reward signals to fine-tune our model.

Prelude to InstuctGPT: Combining GPT and Reinforcement Learning

In fact, after the birth of GPT2 in 2019, OpenAI has tried to combine GPT-2 and reinforcement learning. In the Learning to Summarize with Human Feedback work at NeurIPS 2020, it was written that OpenAI was trained using reinforcement learning from human feedback when summarizing. From the overall flowchart of this work, we can see the core idea of three steps: collect feedback data -> train reward model -> PPO reinforcement learning.


The first stage of RLHF is to manually sort multiple candidate summaries (here reflects OpenAI’s banknote ability, billed according to the marking time, and those marked too fast will be expelled); the second stage is to train the sorting model (still using the GPT model) ; The third stage is to use the PPO algorithm to learn Policy (GPT fine-tuned on the summary task).

The model in this paper can produce better summarization results than 10 times larger model capacity. But the paper also points out that the success of the model is partly due to increasing the size of the reward model. But this requires a huge amount of computing resources - training a 6.7B reinforcement learning model requires a cost of 320 GPU-days.

OpenAI's Fine-Tuning GPT-2 from Human Preferences in early 2020 can be seen that it also first uses the pre-training model to train the reward model, and then uses the PPO strategy for reinforcement learning. The overall steps are the first to see the prototype of ChatGPT!


The idea of RLHF (reinforcement learning from human feedback) was proposed in the work of OpenAI Deep Reinforcement Learning from Human Preferences in June 2017. The core idea is to use human feedback to judge the segment that is closest to the video behavior goal; train to find the reward function that best explains the human judgment, and then use RL to learn how to achieve this goal.


It can be said that ChatGPT is an excellent work done on the shoulders of InstructGPT and the above theories. They combine LLM (large language model)/PTM (pretrain language model) with RL (reinforcement learning) to prove that this direction is feasible . Of course, this is also the direction of NLP and even general intelligent agents that will continue to develop in the future.


PPOs

PPO (Proximal Policy Optimization) A new type of Policy Gradient algorithm (Policy Gradient is a reinforcement learning algorithm that solves the problem of achieving goals in the environment by optimizing the agent's behavior strategy). We only need to understand that the ordinary Policy Gradient algorithm is very sensitive to the step size, but it is difficult to choose an appropriate step size. In the training process, if the change difference between the new and old strategies is too large, it is not conducive to learning.

However, PPO proposes a new objective function that can update small batches in multiple training steps, which solves the problem that the step size in the Policy Gradient algorithm is difficult to determine. Due to its advantages of simple implementation, stable performance, simultaneous processing of discrete/continuous action space problems, and large-scale training, it has received widespread attention in recent years and has become the default reinforcement learning algorithm of OpenAI.


WebGPT and C IC ERO

In the past two years, using LLM+RL and research on reinforcement learning and NLP training, major giants have done a lot of solid work in this field, and these results are as remarkable as ChatGPT. Here we take OpenAI's WebGPT and Meta's Cicero as examples.

WebGPT is an OpenAI end of 2021 effort. Its core idea is to use the powerful generation ability of the GPT3 model to learn a series of behaviors of humans using search engines, and to predict human preferences by training reward models, so that WebGPT can search web pages by itself to answer questions in open domains, and the answers generated are as fast as possible. likely to satisfy human preferences.

Cicero is an AI system that Meta AI released last month that can play text strategy games at a human level. It can also interact with humans, and can use strategic reasoning and natural language to interact and compete with humans in gameplay. At its core, Cicero is driven by a dialogue engine, which uses RL intensively, and a strategic reasoning engine, similar to GPT3.