A simple formulation and its implementation to get the best top k documents given a query, considering precision and diversity as variables.
Given a set of documents D and a set of queries Q the goal of Learning to Rank (L2R) is to learn a model that ranks D and any other documents, given any other query. In the "classic" version, we're just concerned with precision. But here we added another variable, diversity.
So, we try to optimize the F score between precision and diversity. Diversity defined as the number of different types found among the top k ranked documents.
The precision of each document can be forecasted using any L2R model; one can use methods like Random Forest, SVM, LambdaMART, etc. And the types of each document can be defined using any method you want.
Three integers: n, m and k.
n representing the number of documents.
m representing the number of types of documents.
k representing the number of documents that will be selected.
n lines follows:
For each line there is a real number p, the precision of the i-th document.
Another n lines follows:
For each line there is an integer x, the number of types assigned to the i-th document.
The integer x is followed by x other integers, the types of the document.
Everything here is 0-based.
Input Example:
5 3 2
0.1
0.5
0.7
0.9
0.3
1 1
3 0 1 2
2 1 2
0
1 0