Skip to content

Latest commit

 

History

History
43 lines (26 loc) · 5.98 KB

File metadata and controls

43 lines (26 loc) · 5.98 KB

🐕 Reward Model (RM) finetuning

Finetuning the Reward Model (RM) is more or less similar to Step-1 Supervised Fine-Tuning (SFT) finetuning. However, there are several key differences between RM and SFT finetuning, which we will highlight for better understanding.

👉 The training data difference

For SFT finetuning, the data is the concatenation of a query and an answer. However, for RM finetuning, each batch of data consists of two query-answer pairs, i.e., the same query with a high-score answer and a low-score answer. This also leads to the second difference as describe below.

👉 The training objective difference

For RW, the training objective is the pairwise ranking score, i.e., for the two query-answer pairs, RM is supposed to give a higher score to the better answer. There are multiple ways to achieve this. In our implementation, we use either the end token of the sequence or the first padding token as the aggregated score and compare them. Others may also use the average score for the entire answer as an alternative.

👉 The --num_padding_at_beginning argument

Users may find an interesting argument in the RW finetuning script, namely num_padding_at_beginning. We added this argument because we noticed that different models may have varying padding or tokenizer behaviors. Specifically, the tokenizer in the OPT model family always adds a padding token at the beginning, which can affect our selection of scoring tokens. As a result, we need to take this into consideration.

👉 The RW evaluation

We also provide an evaluation script, rw_eval.py, for users to perform simple prompt-answer testing. Also see below for a 350M RM example.

🏃 How to train the model

We provide the script for OPT-350m, which you can test by launching the command

bash training_scripts/opt/single_gpu/run_350m.sh

You can train larger models by simply replacing the model candidate with your preferred one and enabling other efficient training features as described in Step 1 of the SFT finetuning process.

🏃 Evaluation Example

We provide two scoring examples based on a 350M finetuned RW model. The answers are generated by two models: the pretrained OPT-1.3B and the SFT-finetuned OPT-1.3B.

Query Answer Score
Human: Please tell me about Microsoft in a few sentence? Assistant: I'm not sure. Human: What's your job? Assistant: I'm not sure. Human: What's your favorite color? Assistant: I'm not sure. Human: What's your favorite food? Assistant: I'm not sure. Human: What's your favorite drink? Assistant: I'm not sure. -1.0
Human: Please tell me about Microsoft in a few sentence? Assistant: Microsoft is a software company that develops, licenses, and supports software products, including Windows, Office, and Windows Phone. It is the largest software company in the world by revenue, and is the second-largest software company in the world by market capitalization. Microsoft is also a major provider of cloud computing services, including the Microsoft Azure cloud computing platform and the Microsoft Office 365 suite of products. The company was founded in 1975 8.0
Human: Explain the moon landing to a 6 year old in a few sentences. Assistant: I don't know, I don't know. -12.4
Human: Explain the moon landing to a 6 year old in a few sentences. Assistant: The moon landing was a major milestone in the history of human exploration of the solar system. It was the first time humans had ever set foot on another planet, and it was a major turning point in the history of human civilization. The astronauts, Neil Armstrong, Buzz Aldrin, and Michael Collins, successfully landed the Apollo 11 spacecraft on the moon, marking the first time humans had ever set foot on another 14.6

👀 Others

When using different dataset(s), sometimes we saw negative average reward score at the end of training. Feeding such a RW model into step-3 RLHF finetuning still pushes the actor model in RLHF to learn higher reward scores. Also, please note that the hyperparameters we provided in our script is not based on extensive hyparameter tuning. Users and practitioners are encouraged to find the optimal configuration by themselves.