From 9afc550f1d43969a32ad4a6316e87da74eabbff5 Mon Sep 17 00:00:00 2001 From: shengguangming Date: Wed, 11 Dec 2024 22:53:38 +0800 Subject: [PATCH] [doc] add a new quickstart section --- docs/index.rst | 12 +- docs/{preparation => start}/install.rst | 0 docs/start/quickstart.rst | 172 ++++++++++++++++++++++++ 3 files changed, 182 insertions(+), 2 deletions(-) rename docs/{preparation => start}/install.rst (100%) create mode 100644 docs/start/quickstart.rst diff --git a/docs/index.rst b/docs/index.rst index 0cadb743..c91a8f91 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -26,11 +26,19 @@ veRL is fast with: .. toctree:: :maxdepth: 5 - :caption: Preparation + :caption: Quickstart + :titlesonly: + :numbered: + + start/install + start/quickstart + +.. toctree:: + :maxdepth: 5 + :caption: Data Preparation :titlesonly: :numbered: - preparation/install preparation/prepare_data preparation/reward_function diff --git a/docs/preparation/install.rst b/docs/start/install.rst similarity index 100% rename from docs/preparation/install.rst rename to docs/start/install.rst diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst new file mode 100644 index 00000000..2ac68454 --- /dev/null +++ b/docs/start/quickstart.rst @@ -0,0 +1,172 @@ +.. _quickstart: + +========== +Quickstart: Fintune a LLM using PPO with GSM8K dataset +========== + +Post-train a LLM using GSM8K dataset +==================== + +Introduction +------------ + +In this example, we train an LLM to tackle the GSM8k task. + +Paper: https://arxiv.org/pdf/2110.14168 + +Dataset: https://huggingface.co/datasets/gsm8k + +Note that the original paper mainly focuses on training a verifier (a +reward model) to solve math problems via Best-of-N sampling. In this +example, we train an RLHF agent using a rule-based reward model. + +Dataset Introduction +-------------------- + +GSM8k is a math problem dataset. The prompt is an elementary school +problem. The LLM model is required to answer the math problem. + +The training set contains 7473 samples and the test set contains 1319 +samples. + +**An example** + +Prompt + + Katy makes coffee using teaspoons of sugar and cups of water in the + ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups + of water, calculate the number of teaspoonfuls of sugar she used. + +Solution + + The total ratio representing the ingredients she used to make the + coffee is 7+13 = <<7+13=20>>20 Since the fraction representing the + number of teaspoons she used is 7/20, she used 7/20\ *120 = + <<7/20*\ 120=42>>42 #### 42 + +Step 1: Prepare dataset +----------------------- + +.. code:: bash + + cd examples/data_preprocess + python3 gsm8k.py --local_dir ~/data/gsm8k + +Step 2: Download Model +---------------------- + +There’re three ways to prepare the model checkpoints for post-training: + +- Download the required models from huggingface + +.. code:: bash + + huggingface-cli download deepseek-ai/deepseek-math-7b-instruct --local-dir ~/models/deepseek-math-7b-instruct --local-dir-use-symlinks False + +- Already store your store model in the local directory or HDFS path. +- Also, you can directly use the model name in huggingface (e.g., + deepseek-ai/deepseek-math-7b-instruct) in + ``actor_rollout_ref.model.path`` and ``critic.model.path`` field in + the run script. + +Noted that users should prepare checkpoints for actor, critic and reward +model. + +[Optional] Step 3: SFT your Model +--------------------------------- + +We provide a SFT Trainer using PyTorch FSDP in +`fsdp_sft_trainer.py `_. +Users can customize their own SFT +script using our FSDP SFT Trainer. + +We also provide various training scripts for SFT on GSM8K dataset in `gsm8k sft directory `_. + +.. code:: shell + + set -x + + torchrun -m verl.trainer.fsdp_sft_trainer \ + data.train_files=$HOME/data/gsm8k/train.parquet \ + data.val_files=$HOME/data/gsm8k/test.parquet \ + data.prompt_key=question \ + data.response_key=answer \ + data.micro_batch_size=8 \ + model.partial_pretrain=deepseek-ai/deepseek-coder-6.7b-instruct \ + trainer.default_hdfs_dir=hdfs://user/verl/experiments/gsm8k/deepseek-coder-6.7b-instruct/ \ + trainer.project_name=gsm8k-sft \ + trainer.experiment_name=gsm8k-sft-deepseek-coder-6.7b-instruct \ + trainer.total_epochs=4 \ + trainer.logger=['console','tracking'] + +Step 4: Perform PPO training with your model on GSM8K Dataset +------------------------------------------------------------- + +- Prepare your own run.sh script. Here’s an example for GSM8k dataset + and deepseek-llm-7b-chat model. +- Users could replace the ``data.train_files`` ,\ ``data.val_files``, + ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on + their environment. +- See :doc:`config` for detailed explaination of each config field. + +**Reward Model/Function** + +We use a rule-based reward model. We force the model to produce a final +answer following 4 “#” as shown in the solution. We extract the final +answer from both the solution and model’s output using regular +expression matching. We compare them and assign a reward of 1 to correct +answer, 0.1 to incorrect answer and 0 to no answer. + +**Training Script** + +The training script example for FSDP and Megatron-LM backend are stored in +`examples/ppo_trainer `_ directory. + +.. code:: bash + + cd ../ppo_trainer + bash run_deepseek7b_llm.sh + +The script of `run_deepseek7b_llm.sh` + +.. code:: bash + + set -x + + python3 -m verl.trainer.main_ppo \ + data.train_files=~/data/rlhf/gsm8k/train.parquet \ + data.val_files=~/data/rlhf/gsm8k/test.parquet \ + data.train_batch_size=1024 \ + data.val_batch_size=1312 \ + data.max_prompt_length=512 \ + data.max_response_length=512 \ + actor_rollout_ref.model.path=~/models/deepseek-llm-7b-chat \ + actor_rollout_ref.actor.optim.lr=1e-6 \ + actor_rollout_ref.actor.ppo_mini_batch_size=256 \ + actor_rollout_ref.actor.ppo_micro_batch_size=64 \ + actor_rollout_ref.actor.fsdp_config.param_offload=False \ + actor_rollout_ref.actor.fsdp_config.grad_offload=False \ + actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ + actor_rollout_ref.rollout.micro_batch_size=256 \ + actor_rollout_ref.rollout.log_prob_micro_batch_size=128 \ + actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ + actor_rollout_ref.rollout.name=vllm \ + actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ + actor_rollout_ref.ref.log_prob_micro_batch_size=128 \ + actor_rollout_ref.ref.fsdp_config.param_offload=True \ + critic.optim.lr=1e-5 \ + critic.model.path=~/models/deepseek-llm-7b-chat \ + critic.model.enable_gradient_checkpointing=False \ + critic.ppo_micro_batch_size=64 \ + critic.model.fsdp_config.param_offload=False \ + critic.model.fsdp_config.grad_offload=False \ + critic.model.fsdp_config.optimizer_offload=False \ + algorithm.kl_ctrl.kl_coef=0.001 \ + trainer.critic_warmup=0 \ + trainer.logger=['console','tracking'] \ + trainer.project_name='verl_example_gsm8k' \ + trainer.experiment_name='deepseek_llm_7b_function_rm' \ + trainer.n_gpus_per_node=8 \ + trainer.nnodes=1 \ + trainer.save_freq=-1 \ + trainer.total_epochs=15