Learning to rank

Data pipelines, model fitting and feature selection

Vespa logo

This notebook is WIP and not runnable - ToDo FIXME

Data

This section describes the data that we are going to use to give a brief overview of the pyvespa ranking framework. The data was collected from a running Vespa application indexed with MS MARCO data. For each relevant (document_id, query_id)-pair we collected 9 random matched documents. Relevant documents have label=1 and non-relevant documents have label=0. In addition, many Vespa ranking features computed based on document and query interaction are included.

train_df = pd.read_csv("https://data.vespa.oath.cloud/blog/ranking/train_sample.csv")

The data used here is a sample containing 100.000 rows and 71 features.

train_df.shape
(100000, 74)
train_df.head(10)
document_id query_id label elementCompleteness(body).completeness elementCompleteness(body).fieldCompleteness elementCompleteness(body).queryCompleteness fieldMatch(body) fieldMatch(body).absoluteOccurrence fieldMatch(body).absoluteProximity fieldMatch(body).completeness ... term(3).significance term(3).weight term(4).connectedness term(4).significance term(4).weight textSimilarity(body).fieldCoverage textSimilarity(body).order textSimilarity(body).proximity textSimilarity(body).queryCoverage textSimilarity(body).score
0 27061 3 0 0.358796 0.092593 0.625 0.127746 0.022000 0.02600 0.598380 ... 0.504935 100.0 0.1 0.674337 100.0 0.092593 0.250000 0.437500 0.625 0.396644
1 257 3 0 0.359670 0.094340 0.625 0.092319 0.018000 0.03500 0.598467 ... 0.504935 100.0 0.1 0.674337 100.0 0.094340 0.750000 0.234375 0.625 0.400899
2 363 3 0 0.277397 0.054795 0.500 0.141511 0.030000 0.07100 0.477740 ... 0.504935 100.0 0.1 0.674337 100.0 0.054795 0.666667 0.640625 0.500 0.485178
3 22682 3 0 0.333686 0.042373 0.625 0.250817 0.056000 0.10000 0.595869 ... 0.504935 100.0 0.1 0.674337 100.0 0.042373 0.250000 0.324219 0.625 0.346951
4 160 3 0 0.295455 0.090909 0.500 0.118351 0.015000 0.05000 0.479545 ... 0.504935 100.0 0.1 0.674337 100.0 0.090909 0.666667 0.557292 0.500 0.463234
5 228 3 0 0.286364 0.072727 0.500 0.148612 0.015000 0.10000 0.478636 ... 0.504935 100.0 0.1 0.674337 100.0 0.072727 0.000000 0.286458 0.500 0.264806
6 3901893 3 0 0.433824 0.117647 0.750 0.345256 0.025000 0.07700 0.718382 ... 0.504935 100.0 0.1 0.674337 100.0 0.117647 0.600000 0.575000 0.750 0.539779
7 1142680 3 1 0.412037 0.074074 0.750 0.343120 0.046667 0.07700 0.716204 ... 0.504935 100.0 0.1 0.674337 100.0 0.074074 0.600000 0.615625 0.750 0.545284
8 141 3 0 0.286364 0.072727 0.500 0.081461 0.027500 0.10000 0.478636 ... 0.504935 100.0 0.1 0.674337 100.0 0.072727 0.666667 0.406250 0.500 0.406733
9 3060834 3 0 0.410294 0.070588 0.750 0.308250 0.045000 0.06675 0.716029 ... 0.504935 100.0 0.1 0.674337 100.0 0.070588 0.400000 0.715625 0.750 0.549586

10 rows × 74 columns

Similarly, we collected data based on the MS MARCO queries contained on the dev set.

dev_df = pd.read_csv("https://data.vespa.oath.cloud/blog/ranking/dev_sample.csv")
dev_df.shape
(74103, 72)
dev_df.head(10)
document_id query_id label elementCompleteness(body).completeness elementCompleteness(body).fieldCompleteness elementCompleteness(body).queryCompleteness fieldMatch(body) fieldMatch(body).absoluteOccurrence fieldMatch(body).absoluteProximity fieldMatch(body).completeness ... term(3).significance term(3).weight term(4).connectedness term(4).significance term(4).weight textSimilarity(body).fieldCoverage textSimilarity(body).order textSimilarity(body).proximity textSimilarity(body).queryCoverage textSimilarity(body).score
0 8066640 2 0 0.380952 0.095238 0.666667 0.427344 0.01 0.1 0.638095 ... 0.0 0.0 0.0 0.0 0.0 0.095238 1.0 1.0 0.666667 0.719048
1 4339068 2 1 0.346667 0.026667 0.666667 0.444933 0.04 0.1 0.634667 ... 0.0 0.0 0.0 0.0 0.0 0.026667 1.0 1.0 0.666667 0.705333
2 762768 2 0 0.343750 0.020833 0.666667 0.088859 0.01 0.1 0.634375 ... 0.0 0.0 0.0 0.0 0.0 0.020833 1.0 0.0 0.666667 0.354167
3 3370 2 0 0.180180 0.027027 0.333333 0.162049 0.01 0.1 0.318018 ... 0.0 0.0 0.0 0.0 0.0 0.027027 0.0 0.0 0.333333 0.105405
4 6060 2 0 0.175287 0.017241 0.333333 0.145722 0.01 0.1 0.317529 ... 0.0 0.0 0.0 0.0 0.0 0.017241 0.0 0.0 0.333333 0.103448
5 3798 2 0 0.180556 0.027778 0.333333 0.166942 0.01 0.1 0.318056 ... 0.0 0.0 0.0 0.0 0.0 0.027778 0.0 0.0 0.333333 0.105556
6 2731175 2 0 0.345833 0.025000 0.666667 0.398800 0.01 0.1 0.634583 ... 0.0 0.0 0.0 0.0 0.0 0.025000 1.0 1.0 0.666667 0.705000
7 3634083 2 0 0.351190 0.035714 0.666667 0.423611 0.02 0.1 0.635119 ... 0.0 0.0 0.0 0.0 0.0 0.035714 1.0 1.0 0.666667 0.707143
8 112126 2 0 0.176282 0.019231 0.333333 0.177009 0.02 0.1 0.317628 ... 0.0 0.0 0.0 0.0 0.0 0.019231 0.0 0.0 0.333333 0.103846
9 3387 2 0 0.178571 0.023810 0.333333 0.171357 0.01 0.1 0.317857 ... 0.0 0.0 0.0 0.0 0.0 0.023810 0.0 0.0 0.333333 0.104762

10 rows × 72 columns

Listwise ranking framework

The ListwiseRankingFramework uses TensorFlow Ranking to minimize a listwise loss function that is a smooth approximation of the NDCG metric. The following parameters need to be specified:

from learntorank.ranking import ListwiseRankingFramework

ranking_framework = ListwiseRankingFramework(
    #
    # Task related 
    #
    number_documents_per_query=10,  # The size of the list for each sample
    top_n=10,                       # What NDCG position we want to optmize, e.g. NDCG@10
    #
    # Data pipeline 
    #
    batch_size=32,                  # Batch size used when fitting models to the data
    shuffle_buffer_size=1000,       # The buffer size used when shuffling data batches.
    #
    # Hyperparameter tuning 
    #
    tuner_max_trials=3,             # How many trials to execute when search hyperparameters
    tuner_executions_per_trial=1,   # How may model fit per trial
    tuner_epochs=10,                # How many epochs to use per execution of the trial
    tuner_early_stop_patience=None, # Set patience number for early stopping
    #
    # Final model
    #
    final_epochs=30                 # Number of epochs to use when fitting the model with specific hyperparameters.
)
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)

Data pipeline

It is possible to create TensorFlow data pipelines (tf.data.Dataset) either from in-memory data frames or directly from .csv files to avoid the need to load large file into memory. The data pipelines are suited for listwise ranking and can be used as part of a custom tensorflow workflow if desired.

Create a tf.data.Dataset from in-memory data frames:

tf_ds = ranking_framework.listwise_tf_dataset_from_df(
    df=train_df, 
    feature_names=["nativeFieldMatch", "nativeProximity", "nativeRank"],
    shuffle_buffer_size=3,
    batch_size=1
)

Note that the is already suited for listwise learning.

for batch in tf_ds.take(1):
    print(batch)
(<tf.Tensor: shape=(1, 10, 3), dtype=float32, numpy=
array([[[1.9765680e-01, 6.5953881e-02, 9.5175676e-02],
        [1.3242842e-01, 1.1140537e-01, 7.1235448e-02],
        [3.4112938e-02, 1.2160993e-37, 1.5161305e-02],
        [1.5705481e-01, 4.0344268e-02, 7.4284837e-02],
        [8.6454414e-02, 3.2825880e-02, 4.2071503e-02],
        [1.9139472e-01, 1.1913208e-01, 9.8301217e-02],
        [4.8045117e-02, 1.2160993e-37, 2.1353386e-02],
        [1.4903504e-01, 1.3032080e-01, 8.0717884e-02],
        [6.3953400e-02, 2.8740479e-02, 3.1617120e-02],
        [1.5656856e-01, 6.8069249e-02, 7.7149279e-02]]], dtype=float32)>, <tf.Tensor: shape=(1, 10), dtype=float32, numpy=array([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]], dtype=float32)>)

For large data, we can also create a listwise tf.data.Dataset directly from a .csv file, without the need to load it into memory:

train_df.to_csv("train_sample.csv", index=False)
tf_ds = ranking_framework.listwise_tf_dataset_from_csv(
    file_path="train_sample.csv",
    feature_names=["nativeFieldMatch", "nativeProximity", "nativeRank"],
    shuffle_buffer_size=3,
    batch_size=1
)
for batch in tf_ds.take(1):
    print(batch)
(<tf.Tensor: shape=(1, 10, 3), dtype=float32, numpy=
array([[[0.08348585, 0.04784278, 0.04242069],
        [0.08451388, 0.01466913, 0.03919163],
        [0.07139124, 0.02419666, 0.03441796],
        [0.07348892, 0.02119719, 0.03501699],
        [0.11205826, 0.10210748, 0.06114895],
        [0.06779736, 0.02308168, 0.03269679],
        [0.08361208, 0.00839302, 0.03809348],
        [0.13477945, 0.13513905, 0.07491743],
        [0.17734438, 0.18263273, 0.09911225],
        [0.12978926, 0.15896696, 0.07534712]]], dtype=float32)>, <tf.Tensor: shape=(1, 10), dtype=float32, numpy=array([[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]], dtype=float32)>)

Pre-defined models

The ranking framework comes with same pre-defined models in case you don’t want to use the data pipelines to create your own workflow. It is possible to specify either a DataFrame or a .csv file path as the train and dev input data. If the hyperparameters argument is not specified it will search through the hyperparameter space accordinng to the arguments defined when creating and instance of the ListwiseRankingFramework.

Linear model

weights, dev_eval, best_hyperparams = ranking_framework.fit_linear_model(
    train_data=train_df, 
    dev_data=dev_df, 
    feature_names=[
        "fieldMatch(body).proximity",
        "fieldMatch(body).queryCompleteness",
        "fieldMatch(body).significance",
        "nativeFieldMatch",
        "nativeProximity",
        "nativeRank",
    ],
    hyperparameters=None # Search for best hyperparameters
)
best_hyperparams
{'learning_rate': 6.018683626059954}
weights
{'feature_names': ['fieldMatch(body).proximity',
  'fieldMatch(body).queryCompleteness',
  'fieldMatch(body).significance',
  'nativeFieldMatch',
  'nativeProximity',
  'nativeRank'],
 'linear_model_weights': [0.46931159496307373,
  -30.97307014465332,
  28.785017013549805,
  18.257308959960938,
  12.566983222961426,
  10.918502807617188]}
dev_eval
0.7916887402534485

If we instead specify the hyperpameters, hyperparameter search will be skipped.

weights, dev_eval, best_hyperparams = ranking_framework.fit_linear_model(
    train_data=train_df, 
    dev_data=dev_df, 
    feature_names=[
        "fieldMatch(body).proximity",
        "fieldMatch(body).queryCompleteness",
        "fieldMatch(body).significance",
        "nativeFieldMatch",
        "nativeProximity",
        "nativeRank",
    ],
    hyperparameters={'learning_rate': 6.018683626059954} 
)

Lasso model

weights, dev_eval, best_hyperparams = ranking_framework.fit_lasso_linear_model(
    train_data=train_df, 
    dev_data=dev_df, 
    feature_names=[
        "fieldMatch(body).proximity",
        "fieldMatch(body).queryCompleteness",
        "fieldMatch(body).significance",
        "nativeFieldMatch",
        "nativeProximity",
        "nativeRank",
    ]
)
print(best_hyperparams)
{'lambda': 0.0023227311360666802, 'learning_rate': 0.14885653869373894}
print(weights)
{'feature_names': ['fieldMatch(body).proximity', 'fieldMatch(body).queryCompleteness', 'fieldMatch(body).significance', 'nativeFieldMatch', 'nativeProximity', 'nativeRank'], 'normalization_mean': [0.8184928894042969, 0.530807375907898, 0.5052036643028259, 0.0906180813908577, 0.039063721895217896, 0.04461509734392166], 'normalization_sd': [0.08662283420562744, 0.05760122463107109, 0.06236378848552704, 0.003072209656238556, 0.003147233510389924, 0.0008713427814655006], 'normalization_number_data': 96990, 'linear_model_weights': [-0.022373167797923088, -2.1850321292877197, 2.055746078491211, 0.21248634159564972, 0.2774745225906372, 0.6118378043174744]}
print(dev_eval)
0.7700856328010559

Feature selection

The are some pre-defined algorithms that can be used for feature selection. The goal is to find a subset of features that are responsible for most of the evaluation metric gains.

Forward selection

Incrementally add one feature at a time and keep the features that maximize the validation metric.

forward_results = ranking_framework.forward_selection_model_search(
    train_data=train_df, 
    dev_data=dev_df, 
    feature_names=[
        "fieldMatch(body).proximity",
        "fieldMatch(body).queryCompleteness",
        "fieldMatch(body).significance",
        "nativeFieldMatch",
        "nativeProximity",
        "nativeRank",
    ],
    output_file="forward_model_search.json",
)

Evaluation metric for one feature model.

[
    (result["evaluation"], result["weights"]["feature_names"]) for 
     result in forward_results 
     if result["number_features"] == 1
]
[(0.4771268367767334, ['fieldMatch(body).proximity']),
 (0.5774978995323181, ['fieldMatch(body).queryCompleteness']),
 (0.3523213565349579, ['fieldMatch(body).significance']),
 (0.693596601486206, ['nativeFieldMatch']),
 (0.673930287361145, ['nativeProximity']),
 (0.704784631729126, ['nativeRank'])]

Evaluation metric for two features keeping the best feature of the oe-feature model

[
    (result["evaluation"], result["weights"]["feature_names"]) for 
     result in forward_results 
     if result["number_features"] == 2
]
[(0.7052107453346252, ['nativeRank', 'fieldMatch(body).proximity']),
 (0.7083131670951843, ['nativeRank', 'fieldMatch(body).queryCompleteness']),
 (0.7050297260284424, ['nativeRank', 'fieldMatch(body).significance']),
 (0.7048313617706299, ['nativeRank', 'nativeFieldMatch']),
 (0.7088075876235962, ['nativeRank', 'nativeProximity'])]

And so on:

[
    (result["evaluation"], result["weights"]["feature_names"]) for 
     result in forward_results 
     if result["number_features"] == 3
]
[(0.7087035179138184,
  ['nativeRank', 'nativeProximity', 'fieldMatch(body).proximity']),
 (0.7237873673439026,
  ['nativeRank', 'nativeProximity', 'fieldMatch(body).queryCompleteness']),
 (0.7073785662651062,
  ['nativeRank', 'nativeProximity', 'fieldMatch(body).significance']),
 (0.709153413772583, ['nativeRank', 'nativeProximity', 'nativeFieldMatch'])]