Dataset

For the passage ranking use case, we will use the MS MARCO passage dataset ¹ through the ir_datasets library. Besides being convenient, ir_datasets solves encoding errors in the original dataset source files.

import ir_datasets
import pandas as pd

Data Exploration

Document corpus

Start by loading the data. The dataset will be downloaded once and cached on disk for future use, so it takes a while the first time the command below is run.

passage_corpus = ir_datasets.load("msmarco-passage")

Number of passages in the document corpus:

passage_corpus.docs_count()

Sample a few passages of the document corpus.

pd.DataFrame(passage_corpus.docs_iter()[0:5])

	doc_id	text
0	0	The presence of communication amid scientific ...
1	1	The Manhattan Project and its atomic bomb help...
2	2	Essay on The Manhattan Project - The Manhattan...
3	3	The Manhattan Project was the name for a proje...
4	4	versions of each volume as well as complementa...

Training data

Load the training data. We use the judged version that only include queries with at least one relevance judgement.

passage_train = ir_datasets.load("msmarco-passage/train/judged")

Relevant documents

Number of relevant judgements:

passage_train.qrels_count()

For each query id, there is a dict of relevant documents containing the document id as key and the relevance score as value.

from learntorank.passage import sample_dict_items

train_qrels_dict = passage_train.qrels_dict()
sample_dict_items(train_qrels_dict, 5)

{'1038069': {'2293922': 1},
 '700425': {'4351261': 1},
 '926242': {'3500124': 1},
 '690553': {'2877918': 1},
 '411317': {'2230220': 1}}

It is interesting to check what is the range of values of the relevance score. The code below shows that the only score available is 1, indicating that the particular document id is relevant to the query id.

set([score 
     for relevant in train_qrels_dict.values() 
     for score in relevant.values()]
   )

{1}

Queries

Number of training queries:

passage_train.queries_count()

The number of queries differs from the number of relevant documents because some of the queries have more than one relevant document associated with it.

Each query contains a query id and a query text.

training_queries = pd.DataFrame(passage_train.queries_iter())
training_queries.head()

	query_id	text
0	121352	define extreme
1	634306	what does chattel mean on credit history
2	920825	what was the great leap forward brainly
3	510633	tattoo fixers how much does it cost
4	737889	what is decentralization process.

Development data

Similarly to the training data, we can load the judged development data and take a look at the queries and relevance judgements.

passage_dev = ir_datasets.load("msmarco-passage/dev/judged")

Relevant documents

Number of relevant judgements:

passage_dev.qrels_count()

For each query id, there is a dict of relevant documents containing the document id as key and the relevance score as value.

dev_qrels_dict = passage_dev.qrels_dict()
sample_dict_items(dev_qrels_dict, 5)

{'255': {'7629892': 1},
 '611327': {'7610137': 1},
 '584695': {'7408281': 1},
 '300246': {'7814106': 1, '7814107': 1},
 '739094': {'7640560': 1}}

Queries

Number of dev queries:

passage_dev.queries_count()

Each query contains a query id and a query text.

dev_queries = pd.DataFrame(passage_dev.queries_iter())
dev_queries.head()

	query_id	text
0	1048578	cost of endless pools/swim spa
1	1048579	what is pcnt
2	1048582	what is paysky
3	1048583	what is paydata
4	1048585	what is paula deen's brother

Data Manipulation

Sample data

Given the large amount of data, it is useful to properly sample data when prototyping, which can be done with the sample_data function. This might take same time in case the full dataset needs to be downloaded for the first time.

from learntorank.passage import sample_data

passage_sample = sample_data(n_relevant=100, n_irrelevant=800)

passage_sample

PassageData(corpus, train_qrels, train_queries, dev_qrels, dev_queries)

Save

We can save the sampled data to disk to avoid regenerating it everytime we need to use it.

passage_sample.save("sample.json")

Load

Load the data back when needed with PassageData.load:

from learntorank.passage import PassageData

loaded_sample = PassageData.load(file_path="sample.json")

Footnotes

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset ↩︎