Skip to content

sin-of-sloth/learning-to-rank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning-to-Rank

Done as part of CS-572: Information Retrieval course instructed by Dr. Eugene Agichtein.

Developed and tested on Ubuntu 22.04.2 LTS.

Contents

  1. Background Information
  2. Implementation Details
  3. Try it yourself

1 Background Information

1.1 Dataset

MSLR-WEB10K - the dataset is machine learning data, in which queries and urls are represented by IDs. The datasets consist of feature vectors extracted from query-url pairs along with relevance judgment labels.

1.2 Implemented Ranking Methods

  • MART
  • LambdaMART
  • XGBoost

MART, LambdaMart implementation from RankLib: https://github.com/codelibs/ranklib or https://sourceforge.net/p/lemur/wiki/RankLib/

XGBoost: https://github.com/dmlc/xgboost

1.3 Dataset Variations

The models are trained and predicted on the provided 5 folds on 3 variations of the MSLR-WEB10K dataset.

  • Setting 1:

    Supervised ranking with Content Features Only (features [1-133] in the dataset).

  • Setting 2:

    Supervised ranking with full feature set (include behavior features) (134-136).

  • Setting 3:

    Partially supervised ranking using click features as weak labels. Derive noisy/weak relevance labels using click features (click count, dwell time), and use those derived labels to train the three models.

    Example labeling heuristic: query-url click fraction for URL is > 0.5 of all clicks for query, and average dwell time > 10 seconds => relevant.

    Experiment with other ideas for inferring relevance from the (limited) click data available. For example, consider learning to predict the relevance label from click data features.

1.4 Evaluation Metrics

  • NDCG@3
  • NDCG@5
  • NDCG@10
  • MAP

2 Implementation Details

2.1 Models

2.1.1 MART and LambdaMART

Both MART and LambdaMART were implemented by invoking RankLib-2.18.jar from a python script using RankLib’s default parameters. The parameters used are:

Parameter Value
Metric to optimize on training data, metric2t NDCG@10
Number of trees 100
Number of leaves for each tree 10
Learning rate 0.1
Number of threshold candidates for tree splitting 256
Minimum # samples each leaf has to contain 1
Early stopping rounds on validation 100

The models were evaluated on the required metrics using the jar file as well.

Sample command to train a model and save it:

$ java -jar RankLib-2.18.jar -train MSLR-WEB10K/Fold<fold>/train.txt -validate MSLR-WEB10K/Fold<fold>/vali.txt -ranker <ranker> -metric2t NDCG@10 -save <model_save_path>

<ranker> takes values 0 for MART and 2 for LambdaMART.

NOTE: To train a model on a subset of features, create a feature subset file where the list of features to be considered by the learner are specified, each on a separate line. Provide the argument -feature <feature_subset_file> to the command.

Sample command to evaluate a pre-trained model and save the model performance:

$ java -jar RankLib-2.18.jar -load <model_save_path> -test MSLR-WEB10K/Fold<fold>/test.txt -metric2T <test_metric> -idv <score_save_path>

For more information on how to use, see The Lemur Project / Wiki / How to Use.

2.1.2 XGBoost

XGBoost implemented using its python package. Used the model xgboost.XGBRanker with the following parameters:

Parameter Value
Number of gradient boosted trees, n_estimators 5
Maximum tree depth for base learners, max_depth 5
Learning objective, objective rank:ndcg
Metric used for monitoring the training result, eval_metric ndcg

The models were evaluated using a custom function to get the ndcg values; MAP was not evaluated for the models.

2.2 Creating Weak Relevance Labels from Click Data

Weak relevance labels for each fold was created by using sklearn.ensemble.RandomForestClassifier. For each fold of data, a model was created using only the click features from the training data. New datasets were created for train, validation, and test, where the relevance labels were predicted using the generated model using their click features in the same format as MSLR-WEB10K data. The parameters used for the model are {max_depth=5, n_estimators=25}.

2.3 Files and Directories

  • run.py - driver program that calls required methods
  • modules/ - has all the required classes
  • Fetch the MSLR-WEB10K data from here and place it in the root directory of this repo, so that a dataset can be accessed from the path MSRL-WEB10L/Fold<fold>/<dataset>.txt
  • Requirements can be found in requirements.txt
  • model_scores.pdf - sample set of scores averaged across folds for each model and setting
  • Models generated will be saved to the path models/<setting>/<ranker>/<ranker>_model_<fold>.txt
  • Scores will be saved to the path scores/<setting>/<ranker>/<metric>/<ranker>_score_<fold>_<dataset>_<metric>.txt

<setting> can be one of content_features_only / all_features / weak_labels

<ranker> can be one of MART / LambdaMART / XGBoost

<fold> can be one of 1 / 2 / 3 / 4 / 5

<metric> can be one of NDCG@3 / NDCG@5 / NDCG@10 / MAP

<dataset> can be one of train / vali / test

2.4 Code Flow

  • Initializes empty directories to save scores and models
  • Creates the feature subset file containing only the content features
  • Creates weak-labeled data as mentioned in Setting 3 in Dataset Variations for each fold and saves the new datasets to the path MSLR-WEB10K/Fold<fold>/weak_<dataset>.txt
  • Creates the models and evaluates their performances for the required metrics
  • Prints the final scores averaged across the five folds for each model

3. Try it yourself

3.1 Installing Requirements

From the root of the repo:

  • Create a virtual environment if you'd like and activate it:

    $ virtualenv -p python3 .venv
    $ source .venv/bin/activate
    
  • Install the requirements:

    $ pip3 install -r requirements.txt
    
  • Download MSLR-WEB10K data and RankLib-2.18 jar to the root of the repo

3.2 Run the program

  • Run the program
    $ python3 run.py
    

You can try creating your own weak labels by modifying the code in modules/weak_labeler.py.

About

Learning to Rank models on the MSLR-WEB10K dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages