Dataset

For the passage ranking use case, we will use the MS MARCO passage dataset 1 through the ir_datasets library. Besides being convenient, ir_datasets solves encoding errors in the original dataset source files.

import ir_datasets
import pandas as pd

Data Exploration

Document corpus

Start by loading the data. The dataset will be downloaded once and cached on disk for future use, so it takes a while the first time the command below is run.

passage_corpus = ir_datasets.load("msmarco-passage")

Number of passages in the document corpus:

passage_corpus.docs_count()
8841823

Sample a few passages of the document corpus.

pd.DataFrame(passage_corpus.docs_iter()[0:5])
doc_id text
0 0 The presence of communication amid scientific ...
1 1 The Manhattan Project and its atomic bomb help...
2 2 Essay on The Manhattan Project - The Manhattan...
3 3 The Manhattan Project was the name for a proje...
4 4 versions of each volume as well as complementa...

Training data

Load the training data. We use the judged version that only include queries with at least one relevance judgement.

passage_train = ir_datasets.load("msmarco-passage/train/judged")

Relevant documents

Number of relevant judgements:

passage_train.qrels_count()
532761

For each query id, there is a dict of relevant documents containing the document id as key and the relevance score as value.

from learntorank.passage import sample_dict_items

train_qrels_dict = passage_train.qrels_dict()
sample_dict_items(train_qrels_dict, 5)
{'1038069': {'2293922': 1},
 '700425': {'4351261': 1},
 '926242': {'3500124': 1},
 '690553': {'2877918': 1},
 '411317': {'2230220': 1}}

It is interesting to check what is the range of values of the relevance score. The code below shows that the only score available is 1, indicating that the particular document id is relevant to the query id.

set([score 
     for relevant in train_qrels_dict.values() 
     for score in relevant.values()]
   )
{1}

Queries

Number of training queries:

passage_train.queries_count()
502939

The number of queries differs from the number of relevant documents because some of the queries have more than one relevant document associated with it.

Each query contains a query id and a query text.

training_queries = pd.DataFrame(passage_train.queries_iter())
training_queries.head()
query_id text
0 121352 define extreme
1 634306 what does chattel mean on credit history
2 920825 what was the great leap forward brainly
3 510633 tattoo fixers how much does it cost
4 737889 what is decentralization process.

Development data

Similarly to the training data, we can load the judged development data and take a look at the queries and relevance judgements.

passage_dev = ir_datasets.load("msmarco-passage/dev/judged")

Relevant documents

Number of relevant judgements:

passage_dev.qrels_count()
59273

For each query id, there is a dict of relevant documents containing the document id as key and the relevance score as value.

dev_qrels_dict = passage_dev.qrels_dict()
sample_dict_items(dev_qrels_dict, 5)
{'255': {'7629892': 1},
 '611327': {'7610137': 1},
 '584695': {'7408281': 1},
 '300246': {'7814106': 1, '7814107': 1},
 '739094': {'7640560': 1}}

Queries

Number of dev queries:

passage_dev.queries_count()
55578

Each query contains a query id and a query text.

dev_queries = pd.DataFrame(passage_dev.queries_iter())
dev_queries.head()
query_id text
0 1048578 cost of endless pools/swim spa
1 1048579 what is pcnt
2 1048582 what is paysky
3 1048583 what is paydata
4 1048585 what is paula deen's brother

Data Manipulation

Sample data

Given the large amount of data, it is useful to properly sample data when prototyping, which can be done with the sample_data function. This might take same time in case the full dataset needs to be downloaded for the first time.

from learntorank.passage import sample_data

passage_sample = sample_data(n_relevant=100, n_irrelevant=800)
passage_sample
PassageData(corpus, train_qrels, train_queries, dev_qrels, dev_queries)

Save

We can save the sampled data to disk to avoid regenerating it everytime we need to use it.

passage_sample.save("sample.json")

Load

Load the data back when needed with PassageData.load:

from learntorank.passage import PassageData

loaded_sample = PassageData.load(file_path="sample.json")