DeepRetrieval - Hacking Search Engines & Retrievers with LLM + RL

Let LLMs learn how to search!

Preliminary Technical Report (ArXiv preprint)

Installation

conda create -n zero python=3.9
# install torch [or you can skip this step and let vllm to install the correct version for you]
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
# install vllm
pip3 install vllm==0.6.3 # or you can install 0.5.4, 0.4.2 and 0.3.1
pip3 install ray

# verl
cd code
pip install -e .

# flash attention 2
pip3 install flash-attn --no-build-isolation
# quality of life
pip install wandb IPython matplotlib

Get started

cd code

1. Data Preparation (required) For example, for PubMed:

conda activate zero
python data_preprocess/pubmed.py

2. Get Your Search Engine API Key (required if use search engine)

For example, for PubMed, you may get it following the instruction here.

Then, put it in under code/verl/utils/reward_score/apis/ as pubmed_api.key.

3. Reward function Related (optional)

Reward Design (e.g., in code/verl/utils/reward_score/pubmed.py):

Recall	≥ 0.7	≥ 0.5	≥ 0.4	≥ 0.3	≥ 0.1	≥ 0.05	< 0.05
Reward	+5.0	+4.0	+3.0	+1.0	+0.5	+0.1	-3.5

4. Customize Monitor Info (optional)

modify compute_reward_metrics() in code/verl/trainer/ppo/ray_trainer.py

Run Training

conda activate zero

For the following code, if you see Out-of-vram, try add critic.model.enable_gradient_checkpointing=True to the script

For example, for PubMed:

sh scripts/train/pubmed_train.sh

Reward Curve During Training

Run Evaluation

sh scripts/eval/pubmed_test.sh

Result (checkpoint date: Feb 16)

Model	Method	Recall (Publication)	Recall (Trial)
GPT-4o	Zero-shot	5.79	6.74
	Few-shot	7.67	4.69
	ICL	19.72	14.26
	ICL+Few-shot	11.95	7.98
GPT-3.5	Zero-shot	4.01	3.37
	Few-shot	4.15	3.34
	ICL	18.68	13.94
	ICL+Few-shot	7.06	5.54
Haiku-3	Zero-shot	10.98	11.59
	Few-shot	14.71	7.47
	ICL	20.92	24.68
	ICL+Few-shot	19.11	9.27
Mistral-7B	Zero-shot	7.18	8.08
LEADS$^{*}$	Zero-shot	24.68	32.11
DeepRetrieval	Zero-shot	60.82	70.84

Table: Comparison of different models and methods on publication search and trial search tasks. Bold numbers indicate the best performance.

$^{*}$ LEADS: a state-of-the-art literature mining LLM trained on 20K reviews and 400K publications [https://arxiv.org/pdf/2501.16255]

Acknowledge

This implementation is mainly based on verl. The base model during the experiment is Qwen2.5-3B. We sincerely appreciate their contributions to the open-source community.

Cite DeepRetrieval

Current version (will update the author list upon project completion):

@misc{jiang2025deepretrievalpowerfulquerygeneration,
      title={DeepRetrieval: Powerful Query Generation for Information Retrieval with Reinforcement Learning}, 
      author={Pengcheng Jiang},
      year={2025},
      eprint={2503.00223},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2503.00223}, 
}

Thanks for your interests! 😊

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
code		code
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepRetrieval - Hacking Search Engines & Retrievers with LLM + RL

Let LLMs learn how to search!

Installation

Get started

Run Training

Reward Curve During Training

Run Evaluation

Acknowledge

Cite DeepRetrieval

About

Releases

Packages

Contributors 2

Languages

License

pat-jj/DeepRetrieval

Folders and files

Latest commit

History

Repository files navigation

DeepRetrieval - Hacking Search Engines & Retrievers with LLM + RL

Let LLMs learn how to search!

Installation

Get started

Run Training

Reward Curve During Training

Run Evaluation

Acknowledge

Cite DeepRetrieval

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages