Done as part of CS-572: Information Retrieval course instructed by Dr. Eugene Agichtein.
Developed and tested on Ubuntu 22.04.2 LTS.
MSLR-WEB10K - the dataset is machine learning data, in which queries and urls are represented by IDs. The datasets consist of feature vectors extracted from query-url pairs along with relevance judgment labels.
MART
LambdaMART
XGBoost
MART
, LambdaMart
implementation from RankLib
: https://github.com/codelibs/ranklib
or https://sourceforge.net/p/lemur/wiki/RankLib/
XGBoost
: https://github.com/dmlc/xgboost
The models are trained and predicted on the provided 5 folds on 3 variations of the MSLR-WEB10K dataset.
-
Setting 1:
Supervised ranking with Content Features Only (features [1-133] in the dataset).
-
Setting 2:
Supervised ranking with full feature set (include behavior features) (134-136).
-
Setting 3:
Partially supervised ranking using click features as weak labels. Derive noisy/weak relevance labels using click features (click count, dwell time), and use those derived labels to train the three models.
Example labeling heuristic: query-url click fraction for URL is > 0.5 of all clicks for query, and average dwell time > 10 seconds => relevant.
Experiment with other ideas for inferring relevance from the (limited) click data available. For example, consider learning to predict the relevance label from click data features.
NDCG@3
NDCG@5
NDCG@10
MAP
Both MART
and LambdaMART
were implemented by invoking
RankLib-2.18.jar
from a python script using
RankLib’s default parameters. The parameters used are:
Parameter | Value |
---|---|
Metric to optimize on training data, metric2t |
NDCG@10 |
Number of trees | 100 |
Number of leaves for each tree | 10 |
Learning rate | 0.1 |
Number of threshold candidates for tree splitting | 256 |
Minimum # samples each leaf has to contain | 1 |
Early stopping rounds on validation | 100 |
The models were evaluated on the required metrics using the jar file as well.
Sample command to train a model and save it:
$ java -jar RankLib-2.18.jar -train MSLR-WEB10K/Fold<fold>/train.txt -validate MSLR-WEB10K/Fold<fold>/vali.txt -ranker <ranker> -metric2t NDCG@10 -save <model_save_path>
<ranker>
takes values 0
for MART
and 2
for LambdaMART
.
NOTE: To train a model on a subset of features, create a feature subset file where the list of features to be
considered by the learner are specified, each on a separate line. Provide the argument -feature <feature_subset_file>
to the command.
Sample command to evaluate a pre-trained model and save the model performance:
$ java -jar RankLib-2.18.jar -load <model_save_path> -test MSLR-WEB10K/Fold<fold>/test.txt -metric2T <test_metric> -idv <score_save_path>
For more information on how to use, see The Lemur Project / Wiki / How to Use.
XGBoost
implemented using its python package. Used the model
xgboost.XGBRanker
with the
following parameters:
Parameter | Value |
---|---|
Number of gradient boosted trees, n_estimators |
5 |
Maximum tree depth for base learners, max_depth |
5 |
Learning objective, objective |
rank:ndcg |
Metric used for monitoring the training result, eval_metric |
ndcg |
The models were evaluated using a custom function to get the ndcg values; MAP was not evaluated for the models.
Weak relevance labels for each fold was created by using
sklearn.ensemble.RandomForestClassifier
.
For each fold of data, a model was created using only the click features from the training data. New datasets were
created for train, validation, and test, where the relevance labels were predicted using the generated model using their
click features in the same format as MSLR-WEB10K data. The parameters used for the model are
{max_depth=5, n_estimators=25}
.
run.py
- driver program that calls required methodsmodules/
- has all the required classes- Fetch the MSLR-WEB10K data from here and place it in the root
directory of this repo, so that a dataset can be accessed from the path
MSRL-WEB10L/Fold<fold>/<dataset>.txt
- Requirements can be found in
requirements.txt
model_scores.pdf
- sample set of scores averaged across folds for each model and setting- Models generated will be saved to the path
models/<setting>/<ranker>/<ranker>_model_<fold>.txt
- Scores will be saved to the path
scores/<setting>/<ranker>/<metric>/<ranker>_score_<fold>_<dataset>_<metric>.txt
<setting>
can be one of content_features_only
/ all_features
/ weak_labels
<ranker>
can be one of MART
/ LambdaMART
/ XGBoost
<fold>
can be one of 1
/ 2
/ 3
/ 4
/ 5
<metric>
can be one of NDCG@3
/ NDCG@5
/ NDCG@10
/ MAP
<dataset>
can be one of train
/ vali
/ test
- Initializes empty directories to save scores and models
- Creates the feature subset file containing only the content features
- Creates weak-labeled data as mentioned in Setting 3 in Dataset Variations for each fold
and saves the new datasets to the path
MSLR-WEB10K/Fold<fold>/weak_<dataset>.txt
- Creates the models and evaluates their performances for the required metrics
- Prints the final scores averaged across the five folds for each model
From the root of the repo:
-
Create a virtual environment if you'd like and activate it:
$ virtualenv -p python3 .venv $ source .venv/bin/activate
-
Install the requirements:
$ pip3 install -r requirements.txt
-
Download MSLR-WEB10K data and RankLib-2.18 jar to the root of the repo
- Run the program
$ python3 run.py
You can try creating your own weak labels by modifying the code in modules/weak_labeler.py
.