import ir_datasets
import pandas as pd
Dataset
For the passage ranking use case, we will use the MS MARCO passage dataset 1 through the ir_datasets
library. Besides being convenient, ir_datasets
solves encoding errors in the original dataset source files.
Data Exploration
Document corpus
Start by loading the data. The dataset will be downloaded once and cached on disk for future use, so it takes a while the first time the command below is run.
= ir_datasets.load("msmarco-passage") passage_corpus
Number of passages in the document corpus:
passage_corpus.docs_count()
8841823
Sample a few passages of the document corpus.
0:5]) pd.DataFrame(passage_corpus.docs_iter()[
doc_id | text | |
---|---|---|
0 | 0 | The presence of communication amid scientific ... |
1 | 1 | The Manhattan Project and its atomic bomb help... |
2 | 2 | Essay on The Manhattan Project - The Manhattan... |
3 | 3 | The Manhattan Project was the name for a proje... |
4 | 4 | versions of each volume as well as complementa... |
Training data
Load the training data. We use the judged
version that only include queries with at least one relevance judgement.
= ir_datasets.load("msmarco-passage/train/judged") passage_train
Relevant documents
Number of relevant judgements:
passage_train.qrels_count()
532761
For each query id, there is a dict of relevant documents containing the document id as key and the relevance score as value.
from learntorank.passage import sample_dict_items
= passage_train.qrels_dict()
train_qrels_dict 5) sample_dict_items(train_qrels_dict,
{'1038069': {'2293922': 1},
'700425': {'4351261': 1},
'926242': {'3500124': 1},
'690553': {'2877918': 1},
'411317': {'2230220': 1}}
It is interesting to check what is the range of values of the relevance score. The code below shows that the only score available is 1, indicating that the particular document id is relevant to the query id.
set([score
for relevant in train_qrels_dict.values()
for score in relevant.values()]
)
{1}
Queries
Number of training queries:
passage_train.queries_count()
502939
The number of queries differs from the number of relevant documents because some of the queries have more than one relevant document associated with it.
Each query contains a query id and a query text.
= pd.DataFrame(passage_train.queries_iter())
training_queries training_queries.head()
query_id | text | |
---|---|---|
0 | 121352 | define extreme |
1 | 634306 | what does chattel mean on credit history |
2 | 920825 | what was the great leap forward brainly |
3 | 510633 | tattoo fixers how much does it cost |
4 | 737889 | what is decentralization process. |
Development data
Similarly to the training data, we can load the judged development data and take a look at the queries and relevance judgements.
= ir_datasets.load("msmarco-passage/dev/judged") passage_dev
Relevant documents
Number of relevant judgements:
passage_dev.qrels_count()
59273
For each query id, there is a dict of relevant documents containing the document id as key and the relevance score as value.
= passage_dev.qrels_dict()
dev_qrels_dict 5) sample_dict_items(dev_qrels_dict,
{'255': {'7629892': 1},
'611327': {'7610137': 1},
'584695': {'7408281': 1},
'300246': {'7814106': 1, '7814107': 1},
'739094': {'7640560': 1}}
Queries
Number of dev queries:
passage_dev.queries_count()
55578
Each query contains a query id and a query text.
= pd.DataFrame(passage_dev.queries_iter())
dev_queries dev_queries.head()
query_id | text | |
---|---|---|
0 | 1048578 | cost of endless pools/swim spa |
1 | 1048579 | what is pcnt |
2 | 1048582 | what is paysky |
3 | 1048583 | what is paydata |
4 | 1048585 | what is paula deen's brother |
Data Manipulation
Sample data
Given the large amount of data, it is useful to properly sample data when prototyping, which can be done with the sample_data
function. This might take same time in case the full dataset needs to be downloaded for the first time.
from learntorank.passage import sample_data
= sample_data(n_relevant=100, n_irrelevant=800) passage_sample
passage_sample
PassageData(corpus, train_qrels, train_queries, dev_qrels, dev_queries)
Save
We can save the sampled data to disk to avoid regenerating it everytime we need to use it.
"sample.json") passage_sample.save(
Load
Load the data back when needed with PassageData.load
:
from learntorank.passage import PassageData
= PassageData.load(file_path="sample.json") loaded_sample