passage

Reference API related to the passage ranking use case.

Data manipulation

Code related to the manipulation of passage ranking data.


source

sample_dict_items

 sample_dict_items (d:Dict, n:int)

Sample items from a dict.

Type Details
d typing.Dict dict to be samples from.
n int Number of samples
Returns typing.Dict dict with sampled values

Usage:

d = {"a": 1, "b":2, "c":3}
sample_dict_items(d, 1)
{'a': 1}
sample_dict_items(d, 2)
{'c': 3, 'b': 2}
sample_dict_items(d, 3)
{'a': 1, 'c': 3, 'b': 2}

Return full dict in case number of samples is higher than length of the dict:

sample_dict_items(d, 4)
{'a': 1, 'c': 3, 'b': 2}

source

save_data

 save_data (corpus:Dict, train_qrels:Dict, train_queries:Dict,
            dev_qrels:Dict, dev_queries:Dict,
            file_path:str='passage_sample.json')

Save data to disk.

The main goal is to save sample data to disk.

Type Default Details
corpus typing.Dict Document corpus, see usage example below.
train_qrels typing.Dict Training relevance scores, see usage example below.
train_queries typing.Dict Training queries, see usage example below.
dev_qrels typing.Dict Development relevance scores, see usage example below.
dev_queries typing.Dict Development queries, see usage example below.
file_path str passage_sample.json Valid JSON file path.
Returns None Side-effect: data is saved to file_path.

Usage:

corpus = {
    "0": "sentence 0", 
    "1": "sentence 1", 
    "2": "sentence 2", 
    "3": "sentence 3"
}
train_queries = {
    "10": "train query 10",
    "11": "train query 11"
}
train_qrels = {
    "10": {"0": 1},
    "11": {"2": 1}
}
dev_queries = {
    "20": "train query 20",
    "21": "train query 21"
}
dev_qrels = {
    "20": {"1": 1},
    "21": {"3": 1}
}
save_data(
    corpus, 
    train_qrels, train_queries, 
    dev_qrels, dev_queries, 
    file_path="passage_sample.json"
)

source

load_data

 load_data (file_path:Optional[str]=None)

Load data.

The main goal is to load sample data from disk. If a file_path is not provided, a pre-generated data sample will be downloaded.

Type Default Details
file_path typing.Optional[str] None valid JSON file path contain data saved by save_data. If None, a pre-generated sample will be downloaded.
Returns typing.Dict See usage example below for expected format.

Usage:

  • With file_path:
data = load_data("passage_sample.json")
data
{'corpus': {'0': 'sentence 0',
  '1': 'sentence 1',
  '2': 'sentence 2',
  '3': 'sentence 3'},
 'train_qrels': {'10': {'0': 1}, '11': {'2': 1}},
 'train_queries': {'10': 'train query 10', '11': 'train query 11'},
 'dev_qrels': {'20': {'1': 1}, '21': {'3': 1}},
 'dev_queries': {'20': 'train query 20', '21': 'train query 21'}}
  • Without file_path specified, a pre-generated sample data will be downloaded:
data = load_data()
data.keys()
dict_keys(['corpus', 'train_qrels', 'train_queries', 'dev_qrels', 'dev_queries'])
len(data["corpus"])
1000

source

PassageData

 PassageData (corpus:Optional[Dict]=None, train_qrels:Optional[Dict]=None,
              train_queries:Optional[Dict]=None,
              dev_qrels:Optional[Dict]=None,
              dev_queries:Optional[Dict]=None)

Container for passage data

Type Default Details
corpus typing.Optional[typing.Dict] None Document corpus, see usage example below.
train_qrels typing.Optional[typing.Dict] None Training relevance scores, see usage example below.
train_queries typing.Optional[typing.Dict] None Training queries, see usage example below.
dev_qrels typing.Optional[typing.Dict] None Development relevance scores, see usage example below.
dev_queries typing.Optional[typing.Dict] None Development queries, see usage example below.

Usage:

corpus = {
    "0": "sentence 0", 
    "1": "sentence 1", 
    "2": "sentence 2", 
    "3": "sentence 3"
}
train_queries = {
    "10": "train query 10",
    "11": "train query 11"
}
train_qrels = {
    "10": {"0": 1},
    "11": {"2": 1}
}
dev_queries = {
    "20": "train query 20",
    "21": "train query 21"
}
dev_qrels = {
    "20": {"1": 1},
    "21": {"3": 1}
}
passage_data = PassageData(
    corpus=corpus, 
    train_queries = train_queries, 
    train_qrels=train_qrels,
    dev_queries = dev_queries,
    dev_qrels = dev_qrels
)
passage_data
PassageData(corpus, train_qrels, train_queries, dev_qrels, dev_queries)

source

PassageData.save

 PassageData.save (file_path:str='passage_sample.json')
passage_data.save()

source

PassageData.load

 PassageData.load (file_path:Optional[str]=None)

Load passage data from disk.

Type Default Details
file_path typing.Optional[str] None valid JSON file path contain data saved by save_data. If None, a pre-generated sample will be downloaded.
Returns PassageData
data = PassageData.load(file_path="passage_sample.json")
data
PassageData(corpus, train_qrels, train_queries, dev_qrels, dev_queries)

source

PassageData.summary

 PassageData.summary ()

Summary of the size of the dataset components.

data.summary
Number of documents: 4
Number of train queries: 2
Number of train relevance judgments: 2
Number of dev queries: 2
Number of dev relevance judgments: 2

source

PassageData.get_corpus

 PassageData.get_corpus ()
passage_data.get_corpus()
doc_id text
0 0 sentence 0
1 1 sentence 1
2 2 sentence 2
3 3 sentence 3

source

PassageData.get_queries

 PassageData.get_queries (type:str)

Get query data.

Type Details
type str Either ‘train’ or ‘dev’.
Returns DataFrame DataFrame conaining ‘query_id’ and ‘query’.
passage_data.get_queries(type="train")
query_id query
0 10 train query 10
1 11 train query 11
passage_data.get_queries(type="dev")
query_id query
0 20 train query 20
1 21 train query 21

source

PassageData.get_labels

 PassageData.get_labels (type:str)

Get labeled data

Type Details
type str Either ‘train’ or ‘dev’.
Returns typing.Dict pyvespa-formatted labeled data
passage_data.get_labels(type="train")
[{'query_id': '10',
  'query': 'train query 10',
  'relevant_docs': [{'id': '0', 'score': 1}]},
 {'query_id': '11',
  'query': 'train query 11',
  'relevant_docs': [{'id': '2', 'score': 1}]}]
passage_data.get_labels(type="dev")
[{'query_id': '20',
  'query': 'train query 20',
  'relevant_docs': [{'id': '1', 'score': 1}]},
 {'query_id': '21',
  'query': 'train query 21',
  'relevant_docs': [{'id': '3', 'score': 1}]}]

source

sample_data

 sample_data (n_relevant:int, n_irrelevant:int)

Sample data from the passage ranking dataset.

The final sample contains n_relevant train relevant documents, n_relevant dev relevant documents and n_irrelevant random documents sampled from the entire corpus.

All the relevant sampled documents, both from train and dev sets, are guaranteed to be on the corpus_sample, which will contain 2 * n_relevant + n_irrelevant documents.

Type Details
n_relevant int The number of relevant documents to sample.
n_irrelevant int The number of non-judged documents to sample.
Returns PassageData

Usage:

sample = sample_data(n_relevant=1, n_irrelevant=3)

The sampled corpus is a dict containing document id as key and the passage text as value.

sample.corpus
{'890370': 'the map of europe gives you a clear view of the political boundaries that segregate the countries in the continent including germany uk france spain italy greece romania ukraine hungary austria sweden finland norway czech republic belgium luxembourg switzerland croatia and albaniahe map of europe gives you a clear view of the political boundaries that segregate the countries in the continent including germany uk france spain italy greece romania ukraine hungary austria sweden finland norway czech republic belgium luxembourg switzerland croatia and albania',
 '5060205': 'Setting custom HTTP headers with cURL can be done by using the CURLOPT_HTTPHEADER option, which can be set with the curl_setopt function. To add headers to your HTTP request you need to put them into a PHP Array, which you can then pass to the cul_setopt function, like demonstrated in the below example.',
 '6096573': "The sugar in RNA is ribose, whereas the sugar in DNA is deoxyribose. The only difference between the two is that in deoxyribose, there is an oxygen missing from the 2' carbon …(there is a H there instead of an OH). This makes DNA more stable/less reactive than RNA. 1 person found this useful.",
 '3092885': 'All three C-Ph bonds are typical of sp 3 - sp 2 carbon-carbon bonds with lengths of approximately 1.47 A, å while The-C o bond length is approximately.1 42. A å the presence of three adjacent phenyl groups confers special properties manifested in the reactivity of. the alcohol',
 '7275560': 'shortest phase of mitosis Anaphase is the shortest phase of mitosis. During anaphase the arranged chromosomes at the metaphase plate are migrate towards their respective poles. Before this migration started, chromosomes are divided into sister chromatids, by the separation of joined centromere of two sister chromatids of a chromosomes.'}

The size of the sampled document corpus is equal to 2 * n_relevant + n_irrelevant.

len(sample.corpus)
5

Sampled queries are dict containing query id as key and query text as value.

print(sample.train_queries)
print(sample.dev_queries)
{'899723': 'what sugar is found in rna'}
{'994205': 'which is the shortest stage in duration'}

Sampled qrels contains one relevant document for each query.

print(sample.train_qrels)
print(sample.dev_qrels)
{'899723': {'6096573': 1}}
{'994205': {'7275560': 1}}

The following relevant documents are guaranteed to be included in the corpus_sample.

['6096573', '7275560']

Evaluate query models


source

evaluate_query_models

 evaluate_query_models (app_package:vespa.package.ApplicationPackage,
                        query_models:List[learntorank.query.QueryModel],
                        metrics:List[learntorank.evaluation.EvalMetric],
                        corpus_size:List[int], output_file_path:str,
                        dev_query_percentage:float=0.006285807802305023,
                        verbose:bool=True, **kwargs)
from learntorank.evaluation import (
    MatchRatio,
    Recall, 
    ReciprocalRank, 
    NormalizedDiscountedCumulativeGain
)
from learntorank.query import QueryModel, OR, Ranking

corpus_size = [100, 200]
app_package = create_basic_search_package(name="PassageEvaluationApp")
query_models = [
    QueryModel(
        name="bm25", 
        match_phase=OR(), 
        ranking=Ranking(name="bm25")
    ),
    QueryModel(
        name="native_rank", 
        match_phase=OR(), 
        ranking=Ranking(name="native_rank")
    )
]
metrics = [
    MatchRatio(),
    Recall(at=100), 
    ReciprocalRank(at=10), 
    NormalizedDiscountedCumulativeGain(at=10)
]
output_file_path = "test.csv"
estimates = evaluate_query_models(
    app_package=app_package,
    query_models=query_models,
    metrics=metrics,
    corpus_size=corpus_size,
    dev_query_percentage=0.5,
    output_file_path=output_file_path, 
    verbose=False
)
*****
Deploy Vespa application:
*****
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for configuration server, 10/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Waiting for application status, 30/300 seconds...
Waiting for application status, 35/300 seconds...
Waiting for application status, 40/300 seconds...
Waiting for application status, 45/300 seconds...
Waiting for application status, 50/300 seconds...
Waiting for application status, 55/300 seconds...
Waiting for application status, 60/300 seconds...
Waiting for application status, 65/300 seconds...
Waiting for application status, 70/300 seconds...
Waiting for application status, 75/300 seconds...
Waiting for application status, 80/300 seconds...