passage

Reference API related to the passage ranking use case.

Data manipulation

Code related to the manipulation of passage ranking data.

sample_dict_items

 sample_dict_items (d:Dict, n:int)

Sample items from a dict.

	Type	Details
d	typing.Dict	dict to be samples from.
n	int	Number of samples
Returns	typing.Dict	dict with sampled values

Usage:

d = {"a": 1, "b":2, "c":3}

sample_dict_items(d, 1)

{'a': 1}

sample_dict_items(d, 2)

{'c': 3, 'b': 2}

sample_dict_items(d, 3)

{'a': 1, 'c': 3, 'b': 2}

Return full dict in case number of samples is higher than length of the dict:

sample_dict_items(d, 4)

{'a': 1, 'c': 3, 'b': 2}

source

save_data

 save_data (corpus:Dict, train_qrels:Dict, train_queries:Dict,
            dev_qrels:Dict, dev_queries:Dict,
            file_path:str='passage_sample.json')

Save data to disk.

The main goal is to save sample data to disk.

	Type	Default	Details
corpus	typing.Dict		Document corpus, see usage example below.
train_qrels	typing.Dict		Training relevance scores, see usage example below.
train_queries	typing.Dict		Training queries, see usage example below.
dev_qrels	typing.Dict		Development relevance scores, see usage example below.
dev_queries	typing.Dict		Development queries, see usage example below.
file_path	str	passage_sample.json	Valid JSON file path.
Returns	None		Side-effect: data is saved to `file_path`.

Usage:

corpus = {
    "0": "sentence 0", 
    "1": "sentence 1", 
    "2": "sentence 2", 
    "3": "sentence 3"
}
train_queries = {
    "10": "train query 10",
    "11": "train query 11"
}
train_qrels = {
    "10": {"0": 1},
    "11": {"2": 1}
}
dev_queries = {
    "20": "train query 20",
    "21": "train query 21"
}
dev_qrels = {
    "20": {"1": 1},
    "21": {"3": 1}
}

save_data(
    corpus, 
    train_qrels, train_queries, 
    dev_qrels, dev_queries, 
    file_path="passage_sample.json"
)

source

load_data

 load_data (file_path:Optional[str]=None)

Load data.

The main goal is to load sample data from disk. If a file_path is not provided, a pre-generated data sample will be downloaded.

	Type	Default	Details
file_path	typing.Optional[str]	None	valid JSON file path contain data saved by `save_data`. If `None`, a pre-generated sample will be downloaded.
Returns	typing.Dict		See usage example below for expected format.

Usage:

With file_path:

data = load_data("passage_sample.json")

data

{'corpus': {'0': 'sentence 0',
  '1': 'sentence 1',
  '2': 'sentence 2',
  '3': 'sentence 3'},
 'train_qrels': {'10': {'0': 1}, '11': {'2': 1}},
 'train_queries': {'10': 'train query 10', '11': 'train query 11'},
 'dev_qrels': {'20': {'1': 1}, '21': {'3': 1}},
 'dev_queries': {'20': 'train query 20', '21': 'train query 21'}}

Without file_path specified, a pre-generated sample data will be downloaded:

data = load_data()

data.keys()

dict_keys(['corpus', 'train_qrels', 'train_queries', 'dev_qrels', 'dev_queries'])

len(data["corpus"])

source

PassageData

 PassageData (corpus:Optional[Dict]=None, train_qrels:Optional[Dict]=None,
              train_queries:Optional[Dict]=None,
              dev_qrels:Optional[Dict]=None,
              dev_queries:Optional[Dict]=None)

Container for passage data

	Type	Default	Details
corpus	typing.Optional[typing.Dict]	None	Document corpus, see usage example below.
train_qrels	typing.Optional[typing.Dict]	None	Training relevance scores, see usage example below.
train_queries	typing.Optional[typing.Dict]	None	Training queries, see usage example below.
dev_qrels	typing.Optional[typing.Dict]	None	Development relevance scores, see usage example below.
dev_queries	typing.Optional[typing.Dict]	None	Development queries, see usage example below.

Usage:

corpus = {
    "0": "sentence 0", 
    "1": "sentence 1", 
    "2": "sentence 2", 
    "3": "sentence 3"
}
train_queries = {
    "10": "train query 10",
    "11": "train query 11"
}
train_qrels = {
    "10": {"0": 1},
    "11": {"2": 1}
}
dev_queries = {
    "20": "train query 20",
    "21": "train query 21"
}
dev_qrels = {
    "20": {"1": 1},
    "21": {"3": 1}
}

passage_data = PassageData(
    corpus=corpus, 
    train_queries = train_queries, 
    train_qrels=train_qrels,
    dev_queries = dev_queries,
    dev_qrels = dev_qrels
)

passage_data

PassageData(corpus, train_qrels, train_queries, dev_qrels, dev_queries)

source

PassageData.save

 PassageData.save (file_path:str='passage_sample.json')

passage_data.save()

source

PassageData.load

 PassageData.load (file_path:Optional[str]=None)

Load passage data from disk.

	Type	Default	Details
file_path	typing.Optional[str]	None	valid JSON file path contain data saved by save_data. If None, a pre-generated sample will be downloaded.
Returns	PassageData

data = PassageData.load(file_path="passage_sample.json")

data

PassageData(corpus, train_qrels, train_queries, dev_qrels, dev_queries)

source

PassageData.summary

 PassageData.summary ()

Summary of the size of the dataset components.

data.summary

Number of documents: 4
Number of train queries: 2
Number of train relevance judgments: 2
Number of dev queries: 2
Number of dev relevance judgments: 2

source

PassageData.get_corpus

 PassageData.get_corpus ()

passage_data.get_corpus()

	doc_id	text
0	0	sentence 0
1	1	sentence 1
2	2	sentence 2
3	3	sentence 3

source

PassageData.get_queries

 PassageData.get_queries (type:str)

Get query data.

	Type	Details
type	str	Either ‘train’ or ‘dev’.
Returns	DataFrame	DataFrame conaining ‘query_id’ and ‘query’.

passage_data.get_queries(type="train")

	query_id	query
0	10	train query 10
1	11	train query 11

passage_data.get_queries(type="dev")

	query_id	query
0	20	train query 20
1	21	train query 21

source

PassageData.get_labels

 PassageData.get_labels (type:str)

Get labeled data

	Type	Details
type	str	Either ‘train’ or ‘dev’.
Returns	typing.Dict	pyvespa-formatted labeled data

passage_data.get_labels(type="train")

[{'query_id': '10',
  'query': 'train query 10',
  'relevant_docs': [{'id': '0', 'score': 1}]},
 {'query_id': '11',
  'query': 'train query 11',
  'relevant_docs': [{'id': '2', 'score': 1}]}]

passage_data.get_labels(type="dev")

[{'query_id': '20',
  'query': 'train query 20',
  'relevant_docs': [{'id': '1', 'score': 1}]},
 {'query_id': '21',
  'query': 'train query 21',
  'relevant_docs': [{'id': '3', 'score': 1}]}]

source

sample_data

 sample_data (n_relevant:int, n_irrelevant:int)

Sample data from the passage ranking dataset.

The final sample contains n_relevant train relevant documents, n_relevant dev relevant documents and n_irrelevant random documents sampled from the entire corpus.

All the relevant sampled documents, both from train and dev sets, are guaranteed to be on the corpus_sample, which will contain 2 * n_relevant + n_irrelevant documents.

	Type	Details
n_relevant	int	The number of relevant documents to sample.
n_irrelevant	int	The number of non-judged documents to sample.
Returns	PassageData

Usage:

sample = sample_data(n_relevant=1, n_irrelevant=3)

The sampled corpus is a dict containing document id as key and the passage text as value.

sample.corpus

{'890370': 'the map of europe gives you a clear view of the political boundaries that segregate the countries in the continent including germany uk france spain italy greece romania ukraine hungary austria sweden finland norway czech republic belgium luxembourg switzerland croatia and albaniahe map of europe gives you a clear view of the political boundaries that segregate the countries in the continent including germany uk france spain italy greece romania ukraine hungary austria sweden finland norway czech republic belgium luxembourg switzerland croatia and albania',
 '5060205': 'Setting custom HTTP headers with cURL can be done by using the CURLOPT_HTTPHEADER option, which can be set with the curl_setopt function. To add headers to your HTTP request you need to put them into a PHP Array, which you can then pass to the cul_setopt function, like demonstrated in the below example.',
 '6096573': "The sugar in RNA is ribose, whereas the sugar in DNA is deoxyribose. The only difference between the two is that in deoxyribose, there is an oxygen missing from the 2' carbon …(there is a H there instead of an OH). This makes DNA more stable/less reactive than RNA. 1 person found this useful.",
 '3092885': 'All three C-Ph bonds are typical of sp 3 - sp 2 carbon-carbon bonds with lengths of approximately 1.47 A, å while The-C o bond length is approximately.1 42. A å the presence of three adjacent phenyl groups confers special properties manifested in the reactivity of. the alcohol',
 '7275560': 'shortest phase of mitosis Anaphase is the shortest phase of mitosis. During anaphase the arranged chromosomes at the metaphase plate are migrate towards their respective poles. Before this migration started, chromosomes are divided into sister chromatids, by the separation of joined centromere of two sister chromatids of a chromosomes.'}

The size of the sampled document corpus is equal to 2 * n_relevant + n_irrelevant.

len(sample.corpus)

Sampled queries are dict containing query id as key and query text as value.

print(sample.train_queries)
print(sample.dev_queries)

{'899723': 'what sugar is found in rna'}
{'994205': 'which is the shortest stage in duration'}

Sampled qrels contains one relevant document for each query.

print(sample.train_qrels)
print(sample.dev_qrels)

{'899723': {'6096573': 1}}
{'994205': {'7275560': 1}}

The following relevant documents are guaranteed to be included in the corpus_sample.

['6096573', '7275560']

Basic search

Code related to a basic search search engine for passage ranking.

source

create_basic_search_package

 create_basic_search_package (name:str='PassageRanking')

Create a basic Vespa application package for passage ranking.

Vespa fields:

The application contain two string fields: doc_id and text.

Vespa rank functions:

The application contain two rank profiles: bm25 and nativeRank.

	Type	Default	Details
name	str	PassageRanking	Name of the application
Returns	ApplicationPackage		pyvespa ApplicationPackage instance.

Usage:

app_package = create_basic_search_package(name="PassageModuleApp")

Check how the Vespa schema definition for this application looks like:

print(app_package.schema.schema_to_text)

schema PassageModuleApp {
    document PassageModuleApp {
        field doc_id type string {
            indexing: attribute | summary
        }
        field text type string {
            indexing: index | summary
            index: enable-bm25
        }
    }
    fieldset default {
        fields: text
    }
    rank-profile bm25 {
        first-phase {
            expression: bm25(text)
        }
        summary-features {
            bm25(text)
        }
    }
    rank-profile native_rank {
        first-phase {
            expression: nativeRank(text)
        }
    }
}

Evaluate query models

source

evaluate_query_models

 evaluate_query_models (app_package:vespa.package.ApplicationPackage,
                        query_models:List[learntorank.query.QueryModel],
                        metrics:List[learntorank.evaluation.EvalMetric],
                        corpus_size:List[int], output_file_path:str,
                        dev_query_percentage:float=0.006285807802305023,
                        verbose:bool=True, **kwargs)

from learntorank.evaluation import (
    MatchRatio,
    Recall, 
    ReciprocalRank, 
    NormalizedDiscountedCumulativeGain
)
from learntorank.query import QueryModel, OR, Ranking

corpus_size = [100, 200]
app_package = create_basic_search_package(name="PassageEvaluationApp")
query_models = [
    QueryModel(
        name="bm25", 
        match_phase=OR(), 
        ranking=Ranking(name="bm25")
    ),
    QueryModel(
        name="native_rank", 
        match_phase=OR(), 
        ranking=Ranking(name="native_rank")
    )
]
metrics = [
    MatchRatio(),
    Recall(at=100), 
    ReciprocalRank(at=10), 
    NormalizedDiscountedCumulativeGain(at=10)
]
output_file_path = "test.csv"

estimates = evaluate_query_models(
    app_package=app_package,
    query_models=query_models,
    metrics=metrics,
    corpus_size=corpus_size,
    dev_query_percentage=0.5,
    output_file_path=output_file_path, 
    verbose=False
)

*****
Deploy Vespa application:
*****
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for configuration server, 10/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Waiting for application status, 30/300 seconds...
Waiting for application status, 35/300 seconds...
Waiting for application status, 40/300 seconds...
Waiting for application status, 45/300 seconds...
Waiting for application status, 50/300 seconds...
Waiting for application status, 55/300 seconds...
Waiting for application status, 60/300 seconds...
Waiting for application status, 65/300 seconds...
Waiting for application status, 70/300 seconds...
Waiting for application status, 75/300 seconds...
Waiting for application status, 80/300 seconds...