Skip to content

This is the code repo for our paper "Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search".

Notifications You must be signed in to change notification settings

OpenBMB/DEBATER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search

arxiv HF Link HF Link two

DEBATER is a novel framework that introduces the Chain-of-Deliberation mechanism to iteratively optimize document representations through a continuous chain of thought. To consolidate information from multiple thinking steps, DEBATER incorporates the Self-Distillation mechanism, which identifies the most informative steps and integrates them into a unified text embedding.

model

Installation

pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Training

We use the dataset from Repetition Improves Language Model Embeddings. The dataset can be downloaded from their GitHub. After downloading, put it in the data folder:

data
└── echo-data
    ├── allnli.jsonl
    ├── dureader.jsonl
    ...

To train the MiniCPM model, you can run the following script:

torchrun --nproc_per_node=4 scripts/run_supervised.py train_configs/Minicpm2-2B.json

or:

torchrun --nproc_per_node=4 scripts/run_supervised.py train_configs/Minicpm3-4B.json

Please modify the contents of the train_configs folder,For example:

{
    "model_name_or_path": "xxxx", 
    "peft_model_name_or_path": null,
    "pooling_mode": "last_token",
    "dataset_name": "E5",
    "dataset_file_path": "xxxx/data/echo-data",
    "remove_unused_columns": false,
    "learning_rate": 2e-4,
    "num_train_epochs": 1,
    "warmup_steps": 300,
    "per_device_train_batch_size": 64,
    "per_device_eval_batch_size": 64,
    "gradient_accumulation_steps": 1,
    "do_train": true,
    "disable_tqdm": false,
    "max_seq_length": 512,
    "overwrite_output_dir": true,
    "output_dir": "xxxxxx", 
    "logging_steps": 50,
    "save_steps": 200,
    "save_only_model": true,
    "stop_after_n_steps": 1000,
    "lora_r": 16,
    "gradient_checkpointing": true,
    "torch_dtype": "bfloat16",
    "attn_implementation": "flash_attention_2",
    "seed": 42
}

The main modified parameters are:

"model_name_or_path": "xxxx", #The path of MiniCPM model
"dataset_file_path": "xxxx", #The path of Traing Data
"output_dir": "xxxx" #The path of Output Dir

Evaluation

In order to use the trained model for evaluation, you need to modify the following content in the mteb package, the path is: your_env_site-packages/mteb/models/llm2vec_models.py, and customize the trained model at the end:

llm2vec_MiniCPM2B = ModelMeta(
    loader=_loader(
        LLM2VecWrapper,
        base_model_name_or_path="xxxxx", # Base MiniCPM Model
        peft_model_name_or_path="xxxxx", # Trained lora parameters
        device_map="auto",
        torch_dtype=torch.bfloat16,
    ),
    name="xxxxxx",  # Custom Name
    languages=["eng_Latn"],
    open_source=True,
    revision=None,
    release_date="2025-01-02",
)

We provide a sample file llm2vec_models.py for reference. Run the following script to evaluate, taking Arguana as an example:

python mteb_eval.py

The checkpoint can be obtained at the following address:

(1) The checkpoint of the DEBATER-2B is here.

(2) The checkpoint of the DEBATER-4B is here.

Citation

If you find our work to be of value and helpful to your research, please acknowledge our contributions by citing us in your publications or projects:

@misc{ji2025learningeffectiverepresentationsdense,
      title={Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search}, 
      author={Yifan Ji and Zhipeng Xu and Zhenghao Liu and Yukun Yan and Shi Yu and Yishan Li and Zhiyuan Liu and Yu Gu and Ge Yu and Maosong Sun},
      year={2025},
      eprint={2502.12974},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2502.12974}, 
}

Contact

If you have questions, suggestions, and bug reports, please email:

bigtailwolf001@gmail.com

About

This is the code repo for our paper "Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages