Skip to content

Files

Latest commit

 

History

History
30 lines (29 loc) · 1.98 KB

RLHF.md

File metadata and controls

30 lines (29 loc) · 1.98 KB

Instruct GPT (Training language models to follow instructions with human feedback)

  • In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback
  • Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning
  • We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback
  • Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent

  • —predicting the next token on a webpage from the internet—is different from the objective “follow the user’s instructions helpfully and safely” the language modeling objective is misaligned
  • image
  • During RLHF fine-tuning, we observe performance regressions compared to GPT-3 on certain public NLP datasets, notably SQuAD (Rajpurkar et al., 2018), DROP (Dua et al., 2019), HellaSwag (Zellers et al., 2019), and WMT 2015 French to English translation (Bojar et al., 2015) -- “alignment tax”
  • image
  • image
  • image

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

  • image
  • image
  • image