= pd.read_csv("https://data.vespa.oath.cloud/blog/ranking/train_sample.csv") train_df
Learning to rank
This notebook is WIP and not runnable - ToDo FIXME
Data
This section describes the data that we are going to use to give a brief overview of the pyvespa
ranking framework. The data was collected from a running Vespa application indexed with MS MARCO data. For each relevant (document_id
, query_id
)-pair we collected 9 random matched documents. Relevant documents have label=1
and non-relevant documents have label=0
. In addition, many Vespa ranking features computed based on document and query interaction are included.
The data used here is a sample containing 100.000 rows and 71 features.
train_df.shape
(100000, 74)
10) train_df.head(
document_id | query_id | label | elementCompleteness(body).completeness | elementCompleteness(body).fieldCompleteness | elementCompleteness(body).queryCompleteness | fieldMatch(body) | fieldMatch(body).absoluteOccurrence | fieldMatch(body).absoluteProximity | fieldMatch(body).completeness | ... | term(3).significance | term(3).weight | term(4).connectedness | term(4).significance | term(4).weight | textSimilarity(body).fieldCoverage | textSimilarity(body).order | textSimilarity(body).proximity | textSimilarity(body).queryCoverage | textSimilarity(body).score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 27061 | 3 | 0 | 0.358796 | 0.092593 | 0.625 | 0.127746 | 0.022000 | 0.02600 | 0.598380 | ... | 0.504935 | 100.0 | 0.1 | 0.674337 | 100.0 | 0.092593 | 0.250000 | 0.437500 | 0.625 | 0.396644 |
1 | 257 | 3 | 0 | 0.359670 | 0.094340 | 0.625 | 0.092319 | 0.018000 | 0.03500 | 0.598467 | ... | 0.504935 | 100.0 | 0.1 | 0.674337 | 100.0 | 0.094340 | 0.750000 | 0.234375 | 0.625 | 0.400899 |
2 | 363 | 3 | 0 | 0.277397 | 0.054795 | 0.500 | 0.141511 | 0.030000 | 0.07100 | 0.477740 | ... | 0.504935 | 100.0 | 0.1 | 0.674337 | 100.0 | 0.054795 | 0.666667 | 0.640625 | 0.500 | 0.485178 |
3 | 22682 | 3 | 0 | 0.333686 | 0.042373 | 0.625 | 0.250817 | 0.056000 | 0.10000 | 0.595869 | ... | 0.504935 | 100.0 | 0.1 | 0.674337 | 100.0 | 0.042373 | 0.250000 | 0.324219 | 0.625 | 0.346951 |
4 | 160 | 3 | 0 | 0.295455 | 0.090909 | 0.500 | 0.118351 | 0.015000 | 0.05000 | 0.479545 | ... | 0.504935 | 100.0 | 0.1 | 0.674337 | 100.0 | 0.090909 | 0.666667 | 0.557292 | 0.500 | 0.463234 |
5 | 228 | 3 | 0 | 0.286364 | 0.072727 | 0.500 | 0.148612 | 0.015000 | 0.10000 | 0.478636 | ... | 0.504935 | 100.0 | 0.1 | 0.674337 | 100.0 | 0.072727 | 0.000000 | 0.286458 | 0.500 | 0.264806 |
6 | 3901893 | 3 | 0 | 0.433824 | 0.117647 | 0.750 | 0.345256 | 0.025000 | 0.07700 | 0.718382 | ... | 0.504935 | 100.0 | 0.1 | 0.674337 | 100.0 | 0.117647 | 0.600000 | 0.575000 | 0.750 | 0.539779 |
7 | 1142680 | 3 | 1 | 0.412037 | 0.074074 | 0.750 | 0.343120 | 0.046667 | 0.07700 | 0.716204 | ... | 0.504935 | 100.0 | 0.1 | 0.674337 | 100.0 | 0.074074 | 0.600000 | 0.615625 | 0.750 | 0.545284 |
8 | 141 | 3 | 0 | 0.286364 | 0.072727 | 0.500 | 0.081461 | 0.027500 | 0.10000 | 0.478636 | ... | 0.504935 | 100.0 | 0.1 | 0.674337 | 100.0 | 0.072727 | 0.666667 | 0.406250 | 0.500 | 0.406733 |
9 | 3060834 | 3 | 0 | 0.410294 | 0.070588 | 0.750 | 0.308250 | 0.045000 | 0.06675 | 0.716029 | ... | 0.504935 | 100.0 | 0.1 | 0.674337 | 100.0 | 0.070588 | 0.400000 | 0.715625 | 0.750 | 0.549586 |
10 rows × 74 columns
Similarly, we collected data based on the MS MARCO queries contained on the dev set.
= pd.read_csv("https://data.vespa.oath.cloud/blog/ranking/dev_sample.csv") dev_df
dev_df.shape
(74103, 72)
10) dev_df.head(
document_id | query_id | label | elementCompleteness(body).completeness | elementCompleteness(body).fieldCompleteness | elementCompleteness(body).queryCompleteness | fieldMatch(body) | fieldMatch(body).absoluteOccurrence | fieldMatch(body).absoluteProximity | fieldMatch(body).completeness | ... | term(3).significance | term(3).weight | term(4).connectedness | term(4).significance | term(4).weight | textSimilarity(body).fieldCoverage | textSimilarity(body).order | textSimilarity(body).proximity | textSimilarity(body).queryCoverage | textSimilarity(body).score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8066640 | 2 | 0 | 0.380952 | 0.095238 | 0.666667 | 0.427344 | 0.01 | 0.1 | 0.638095 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.095238 | 1.0 | 1.0 | 0.666667 | 0.719048 |
1 | 4339068 | 2 | 1 | 0.346667 | 0.026667 | 0.666667 | 0.444933 | 0.04 | 0.1 | 0.634667 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.026667 | 1.0 | 1.0 | 0.666667 | 0.705333 |
2 | 762768 | 2 | 0 | 0.343750 | 0.020833 | 0.666667 | 0.088859 | 0.01 | 0.1 | 0.634375 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.020833 | 1.0 | 0.0 | 0.666667 | 0.354167 |
3 | 3370 | 2 | 0 | 0.180180 | 0.027027 | 0.333333 | 0.162049 | 0.01 | 0.1 | 0.318018 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.027027 | 0.0 | 0.0 | 0.333333 | 0.105405 |
4 | 6060 | 2 | 0 | 0.175287 | 0.017241 | 0.333333 | 0.145722 | 0.01 | 0.1 | 0.317529 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.017241 | 0.0 | 0.0 | 0.333333 | 0.103448 |
5 | 3798 | 2 | 0 | 0.180556 | 0.027778 | 0.333333 | 0.166942 | 0.01 | 0.1 | 0.318056 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.027778 | 0.0 | 0.0 | 0.333333 | 0.105556 |
6 | 2731175 | 2 | 0 | 0.345833 | 0.025000 | 0.666667 | 0.398800 | 0.01 | 0.1 | 0.634583 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.025000 | 1.0 | 1.0 | 0.666667 | 0.705000 |
7 | 3634083 | 2 | 0 | 0.351190 | 0.035714 | 0.666667 | 0.423611 | 0.02 | 0.1 | 0.635119 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.035714 | 1.0 | 1.0 | 0.666667 | 0.707143 |
8 | 112126 | 2 | 0 | 0.176282 | 0.019231 | 0.333333 | 0.177009 | 0.02 | 0.1 | 0.317628 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.019231 | 0.0 | 0.0 | 0.333333 | 0.103846 |
9 | 3387 | 2 | 0 | 0.178571 | 0.023810 | 0.333333 | 0.171357 | 0.01 | 0.1 | 0.317857 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.023810 | 0.0 | 0.0 | 0.333333 | 0.104762 |
10 rows × 72 columns
Listwise ranking framework
The ListwiseRankingFramework
uses TensorFlow Ranking to minimize a listwise loss function that is a smooth approximation of the NDCG metric. The following parameters need to be specified:
from learntorank.ranking import ListwiseRankingFramework
= ListwiseRankingFramework(
ranking_framework #
# Task related
#
=10, # The size of the list for each sample
number_documents_per_query=10, # What NDCG position we want to optmize, e.g. NDCG@10
top_n#
# Data pipeline
#
=32, # Batch size used when fitting models to the data
batch_size=1000, # The buffer size used when shuffling data batches.
shuffle_buffer_size#
# Hyperparameter tuning
#
=3, # How many trials to execute when search hyperparameters
tuner_max_trials=1, # How may model fit per trial
tuner_executions_per_trial=10, # How many epochs to use per execution of the trial
tuner_epochs=None, # Set patience number for early stopping
tuner_early_stop_patience#
# Final model
#
=30 # Number of epochs to use when fitting the model with specific hyperparameters.
final_epochs )
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Data pipeline
It is possible to create TensorFlow data pipelines (tf.data.Dataset
) either from in-memory data frames or directly from .csv files to avoid the need to load large file into memory. The data pipelines are suited for listwise ranking and can be used as part of a custom tensorflow workflow if desired.
Create a tf.data.Dataset
from in-memory data frames:
= ranking_framework.listwise_tf_dataset_from_df(
tf_ds =train_df,
df=["nativeFieldMatch", "nativeProximity", "nativeRank"],
feature_names=3,
shuffle_buffer_size=1
batch_size )
Note that the is already suited for listwise learning.
for batch in tf_ds.take(1):
print(batch)
(<tf.Tensor: shape=(1, 10, 3), dtype=float32, numpy=
array([[[1.9765680e-01, 6.5953881e-02, 9.5175676e-02],
[1.3242842e-01, 1.1140537e-01, 7.1235448e-02],
[3.4112938e-02, 1.2160993e-37, 1.5161305e-02],
[1.5705481e-01, 4.0344268e-02, 7.4284837e-02],
[8.6454414e-02, 3.2825880e-02, 4.2071503e-02],
[1.9139472e-01, 1.1913208e-01, 9.8301217e-02],
[4.8045117e-02, 1.2160993e-37, 2.1353386e-02],
[1.4903504e-01, 1.3032080e-01, 8.0717884e-02],
[6.3953400e-02, 2.8740479e-02, 3.1617120e-02],
[1.5656856e-01, 6.8069249e-02, 7.7149279e-02]]], dtype=float32)>, <tf.Tensor: shape=(1, 10), dtype=float32, numpy=array([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]], dtype=float32)>)
For large data, we can also create a listwise tf.data.Dataset
directly from a .csv file, without the need to load it into memory:
"train_sample.csv", index=False) train_df.to_csv(
= ranking_framework.listwise_tf_dataset_from_csv(
tf_ds ="train_sample.csv",
file_path=["nativeFieldMatch", "nativeProximity", "nativeRank"],
feature_names=3,
shuffle_buffer_size=1
batch_size )
for batch in tf_ds.take(1):
print(batch)
(<tf.Tensor: shape=(1, 10, 3), dtype=float32, numpy=
array([[[0.08348585, 0.04784278, 0.04242069],
[0.08451388, 0.01466913, 0.03919163],
[0.07139124, 0.02419666, 0.03441796],
[0.07348892, 0.02119719, 0.03501699],
[0.11205826, 0.10210748, 0.06114895],
[0.06779736, 0.02308168, 0.03269679],
[0.08361208, 0.00839302, 0.03809348],
[0.13477945, 0.13513905, 0.07491743],
[0.17734438, 0.18263273, 0.09911225],
[0.12978926, 0.15896696, 0.07534712]]], dtype=float32)>, <tf.Tensor: shape=(1, 10), dtype=float32, numpy=array([[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]], dtype=float32)>)
Pre-defined models
The ranking framework comes with same pre-defined models in case you don’t want to use the data pipelines to create your own workflow. It is possible to specify either a DataFrame
or a .csv
file path as the train and dev input data. If the hyperparameters
argument is not specified it will search through the hyperparameter space accordinng to the arguments defined when creating and instance of the ListwiseRankingFramework
.
Linear model
= ranking_framework.fit_linear_model(
weights, dev_eval, best_hyperparams =train_df,
train_data=dev_df,
dev_data=[
feature_names"fieldMatch(body).proximity",
"fieldMatch(body).queryCompleteness",
"fieldMatch(body).significance",
"nativeFieldMatch",
"nativeProximity",
"nativeRank",
],=None # Search for best hyperparameters
hyperparameters )
best_hyperparams
{'learning_rate': 6.018683626059954}
weights
{'feature_names': ['fieldMatch(body).proximity',
'fieldMatch(body).queryCompleteness',
'fieldMatch(body).significance',
'nativeFieldMatch',
'nativeProximity',
'nativeRank'],
'linear_model_weights': [0.46931159496307373,
-30.97307014465332,
28.785017013549805,
18.257308959960938,
12.566983222961426,
10.918502807617188]}
dev_eval
0.7916887402534485
If we instead specify the hyperpameters, hyperparameter search will be skipped.
= ranking_framework.fit_linear_model(
weights, dev_eval, best_hyperparams =train_df,
train_data=dev_df,
dev_data=[
feature_names"fieldMatch(body).proximity",
"fieldMatch(body).queryCompleteness",
"fieldMatch(body).significance",
"nativeFieldMatch",
"nativeProximity",
"nativeRank",
],={'learning_rate': 6.018683626059954}
hyperparameters )
Lasso model
= ranking_framework.fit_lasso_linear_model(
weights, dev_eval, best_hyperparams =train_df,
train_data=dev_df,
dev_data=[
feature_names"fieldMatch(body).proximity",
"fieldMatch(body).queryCompleteness",
"fieldMatch(body).significance",
"nativeFieldMatch",
"nativeProximity",
"nativeRank",
] )
print(best_hyperparams)
{'lambda': 0.0023227311360666802, 'learning_rate': 0.14885653869373894}
print(weights)
{'feature_names': ['fieldMatch(body).proximity', 'fieldMatch(body).queryCompleteness', 'fieldMatch(body).significance', 'nativeFieldMatch', 'nativeProximity', 'nativeRank'], 'normalization_mean': [0.8184928894042969, 0.530807375907898, 0.5052036643028259, 0.0906180813908577, 0.039063721895217896, 0.04461509734392166], 'normalization_sd': [0.08662283420562744, 0.05760122463107109, 0.06236378848552704, 0.003072209656238556, 0.003147233510389924, 0.0008713427814655006], 'normalization_number_data': 96990, 'linear_model_weights': [-0.022373167797923088, -2.1850321292877197, 2.055746078491211, 0.21248634159564972, 0.2774745225906372, 0.6118378043174744]}
print(dev_eval)
0.7700856328010559
Feature selection
The are some pre-defined algorithms that can be used for feature selection. The goal is to find a subset of features that are responsible for most of the evaluation metric gains.
Lasso model search
Fit a lasso model with all feature_names
. Sequentially remove the feature with the smallest absolute weight until there is only one feature in the model.
= ranking_framework.lasso_model_search(
results =train_df,
train_data=dev_df,
dev_data=[
feature_names"fieldMatch(body).proximity",
"fieldMatch(body).queryCompleteness",
"fieldMatch(body).significance",
"nativeFieldMatch",
"nativeProximity",
"nativeRank",
],="lasso_model_search.json",
output_file )
[f"Number of features {len(result['weights']['feature_names'])}; Eval metric: {result['evaluation']}"
for result in results
]
['Number of features 6; Eval metric: 0.7820510864257812',
'Number of features 5; Eval metric: 0.7812100052833557',
'Number of features 4; Eval metric: 0.7958707809448242',
'Number of features 3; Eval metric: 0.7378504872322083',
'Number of features 2; Eval metric: 0.7098456025123596',
'Number of features 1; Eval metric: 0.7048170566558838']
'weights']['feature_names'] for result in results] [result[
[['fieldMatch(body).proximity',
'fieldMatch(body).queryCompleteness',
'fieldMatch(body).significance',
'nativeFieldMatch',
'nativeProximity',
'nativeRank'],
['fieldMatch(body).queryCompleteness',
'fieldMatch(body).significance',
'nativeFieldMatch',
'nativeProximity',
'nativeRank'],
['fieldMatch(body).queryCompleteness',
'fieldMatch(body).significance',
'nativeFieldMatch',
'nativeRank'],
['fieldMatch(body).queryCompleteness',
'fieldMatch(body).significance',
'nativeRank'],
['fieldMatch(body).queryCompleteness', 'nativeRank'],
['nativeRank']]
Forward selection
Incrementally add one feature at a time and keep the features that maximize the validation metric.
= ranking_framework.forward_selection_model_search(
forward_results =train_df,
train_data=dev_df,
dev_data=[
feature_names"fieldMatch(body).proximity",
"fieldMatch(body).queryCompleteness",
"fieldMatch(body).significance",
"nativeFieldMatch",
"nativeProximity",
"nativeRank",
],="forward_model_search.json",
output_file )
Evaluation metric for one feature model.
["evaluation"], result["weights"]["feature_names"]) for
(result[in forward_results
result if result["number_features"] == 1
]
[(0.4771268367767334, ['fieldMatch(body).proximity']),
(0.5774978995323181, ['fieldMatch(body).queryCompleteness']),
(0.3523213565349579, ['fieldMatch(body).significance']),
(0.693596601486206, ['nativeFieldMatch']),
(0.673930287361145, ['nativeProximity']),
(0.704784631729126, ['nativeRank'])]
Evaluation metric for two features keeping the best feature of the oe-feature model
["evaluation"], result["weights"]["feature_names"]) for
(result[in forward_results
result if result["number_features"] == 2
]
[(0.7052107453346252, ['nativeRank', 'fieldMatch(body).proximity']),
(0.7083131670951843, ['nativeRank', 'fieldMatch(body).queryCompleteness']),
(0.7050297260284424, ['nativeRank', 'fieldMatch(body).significance']),
(0.7048313617706299, ['nativeRank', 'nativeFieldMatch']),
(0.7088075876235962, ['nativeRank', 'nativeProximity'])]
And so on:
["evaluation"], result["weights"]["feature_names"]) for
(result[in forward_results
result if result["number_features"] == 3
]
[(0.7087035179138184,
['nativeRank', 'nativeProximity', 'fieldMatch(body).proximity']),
(0.7237873673439026,
['nativeRank', 'nativeProximity', 'fieldMatch(body).queryCompleteness']),
(0.7073785662651062,
['nativeRank', 'nativeProximity', 'fieldMatch(body).significance']),
(0.709153413772583, ['nativeRank', 'nativeProximity', 'nativeFieldMatch'])]