= {"a": 1, "b":2, "c":3} d
passage
Data manipulation
Code related to the manipulation of passage ranking data.
sample_dict_items
sample_dict_items (d:Dict, n:int)
Sample items from a dict.
Type | Details | |
---|---|---|
d | typing.Dict | dict to be samples from. |
n | int | Number of samples |
Returns | typing.Dict | dict with sampled values |
Usage:
1) sample_dict_items(d,
{'a': 1}
2) sample_dict_items(d,
{'c': 3, 'b': 2}
3) sample_dict_items(d,
{'a': 1, 'c': 3, 'b': 2}
Return full dict in case number of samples is higher than length of the dict:
4) sample_dict_items(d,
{'a': 1, 'c': 3, 'b': 2}
save_data
save_data (corpus:Dict, train_qrels:Dict, train_queries:Dict, dev_qrels:Dict, dev_queries:Dict, file_path:str='passage_sample.json')
Save data to disk.
The main goal is to save sample data to disk.
Type | Default | Details | |
---|---|---|---|
corpus | typing.Dict | Document corpus, see usage example below. | |
train_qrels | typing.Dict | Training relevance scores, see usage example below. | |
train_queries | typing.Dict | Training queries, see usage example below. | |
dev_qrels | typing.Dict | Development relevance scores, see usage example below. | |
dev_queries | typing.Dict | Development queries, see usage example below. | |
file_path | str | passage_sample.json | Valid JSON file path. |
Returns | None | Side-effect: data is saved to file_path . |
Usage:
= {
corpus "0": "sentence 0",
"1": "sentence 1",
"2": "sentence 2",
"3": "sentence 3"
}= {
train_queries "10": "train query 10",
"11": "train query 11"
}= {
train_qrels "10": {"0": 1},
"11": {"2": 1}
}= {
dev_queries "20": "train query 20",
"21": "train query 21"
}= {
dev_qrels "20": {"1": 1},
"21": {"3": 1}
}
save_data(
corpus,
train_qrels, train_queries,
dev_qrels, dev_queries, ="passage_sample.json"
file_path )
load_data
load_data (file_path:Optional[str]=None)
Load data.
The main goal is to load sample data from disk. If a file_path
is not provided, a pre-generated data sample will be downloaded.
Type | Default | Details | |
---|---|---|---|
file_path | typing.Optional[str] | None | valid JSON file path contain data saved by save_data . If None , a pre-generated sample will be downloaded. |
Returns | typing.Dict | See usage example below for expected format. |
Usage:
- With
file_path
:
= load_data("passage_sample.json") data
data
{'corpus': {'0': 'sentence 0',
'1': 'sentence 1',
'2': 'sentence 2',
'3': 'sentence 3'},
'train_qrels': {'10': {'0': 1}, '11': {'2': 1}},
'train_queries': {'10': 'train query 10', '11': 'train query 11'},
'dev_qrels': {'20': {'1': 1}, '21': {'3': 1}},
'dev_queries': {'20': 'train query 20', '21': 'train query 21'}}
- Without
file_path
specified, a pre-generated sample data will be downloaded:
= load_data() data
data.keys()
dict_keys(['corpus', 'train_qrels', 'train_queries', 'dev_qrels', 'dev_queries'])
len(data["corpus"])
1000
PassageData
PassageData (corpus:Optional[Dict]=None, train_qrels:Optional[Dict]=None, train_queries:Optional[Dict]=None, dev_qrels:Optional[Dict]=None, dev_queries:Optional[Dict]=None)
Container for passage data
Type | Default | Details | |
---|---|---|---|
corpus | typing.Optional[typing.Dict] | None | Document corpus, see usage example below. |
train_qrels | typing.Optional[typing.Dict] | None | Training relevance scores, see usage example below. |
train_queries | typing.Optional[typing.Dict] | None | Training queries, see usage example below. |
dev_qrels | typing.Optional[typing.Dict] | None | Development relevance scores, see usage example below. |
dev_queries | typing.Optional[typing.Dict] | None | Development queries, see usage example below. |
Usage:
= {
corpus "0": "sentence 0",
"1": "sentence 1",
"2": "sentence 2",
"3": "sentence 3"
}= {
train_queries "10": "train query 10",
"11": "train query 11"
}= {
train_qrels "10": {"0": 1},
"11": {"2": 1}
}= {
dev_queries "20": "train query 20",
"21": "train query 21"
}= {
dev_qrels "20": {"1": 1},
"21": {"3": 1}
}
= PassageData(
passage_data =corpus,
corpus= train_queries,
train_queries =train_qrels,
train_qrels= dev_queries,
dev_queries = dev_qrels
dev_qrels )
passage_data
PassageData(corpus, train_qrels, train_queries, dev_qrels, dev_queries)
PassageData.save
PassageData.save (file_path:str='passage_sample.json')
passage_data.save()
PassageData.load
PassageData.load (file_path:Optional[str]=None)
Load passage data from disk.
Type | Default | Details | |
---|---|---|---|
file_path | typing.Optional[str] | None | valid JSON file path contain data saved by save_data. If None, a pre-generated sample will be downloaded. |
Returns | PassageData |
= PassageData.load(file_path="passage_sample.json") data
data
PassageData(corpus, train_qrels, train_queries, dev_qrels, dev_queries)
PassageData.summary
PassageData.summary ()
Summary of the size of the dataset components.
data.summary
Number of documents: 4
Number of train queries: 2
Number of train relevance judgments: 2
Number of dev queries: 2
Number of dev relevance judgments: 2
PassageData.get_corpus
PassageData.get_corpus ()
passage_data.get_corpus()
doc_id | text | |
---|---|---|
0 | 0 | sentence 0 |
1 | 1 | sentence 1 |
2 | 2 | sentence 2 |
3 | 3 | sentence 3 |
PassageData.get_queries
PassageData.get_queries (type:str)
Get query data.
Type | Details | |
---|---|---|
type | str | Either ‘train’ or ‘dev’. |
Returns | DataFrame | DataFrame conaining ‘query_id’ and ‘query’. |
type="train") passage_data.get_queries(
query_id | query | |
---|---|---|
0 | 10 | train query 10 |
1 | 11 | train query 11 |
type="dev") passage_data.get_queries(
query_id | query | |
---|---|---|
0 | 20 | train query 20 |
1 | 21 | train query 21 |
PassageData.get_labels
PassageData.get_labels (type:str)
Get labeled data
Type | Details | |
---|---|---|
type | str | Either ‘train’ or ‘dev’. |
Returns | typing.Dict | pyvespa-formatted labeled data |
type="train") passage_data.get_labels(
[{'query_id': '10',
'query': 'train query 10',
'relevant_docs': [{'id': '0', 'score': 1}]},
{'query_id': '11',
'query': 'train query 11',
'relevant_docs': [{'id': '2', 'score': 1}]}]
type="dev") passage_data.get_labels(
[{'query_id': '20',
'query': 'train query 20',
'relevant_docs': [{'id': '1', 'score': 1}]},
{'query_id': '21',
'query': 'train query 21',
'relevant_docs': [{'id': '3', 'score': 1}]}]
sample_data
sample_data (n_relevant:int, n_irrelevant:int)
Sample data from the passage ranking dataset.
The final sample contains n_relevant
train relevant documents, n_relevant
dev relevant documents and n_irrelevant
random documents sampled from the entire corpus.
All the relevant sampled documents, both from train and dev sets, are guaranteed to be on the corpus_sample
, which will contain 2 * n_relevant
+ n_irrelevant
documents.
Type | Details | |
---|---|---|
n_relevant | int | The number of relevant documents to sample. |
n_irrelevant | int | The number of non-judged documents to sample. |
Returns | PassageData |
Usage:
= sample_data(n_relevant=1, n_irrelevant=3) sample
The sampled corpus is a dict containing document id as key and the passage text as value.
sample.corpus
{'890370': 'the map of europe gives you a clear view of the political boundaries that segregate the countries in the continent including germany uk france spain italy greece romania ukraine hungary austria sweden finland norway czech republic belgium luxembourg switzerland croatia and albaniahe map of europe gives you a clear view of the political boundaries that segregate the countries in the continent including germany uk france spain italy greece romania ukraine hungary austria sweden finland norway czech republic belgium luxembourg switzerland croatia and albania',
'5060205': 'Setting custom HTTP headers with cURL can be done by using the CURLOPT_HTTPHEADER option, which can be set with the curl_setopt function. To add headers to your HTTP request you need to put them into a PHP Array, which you can then pass to the cul_setopt function, like demonstrated in the below example.',
'6096573': "The sugar in RNA is ribose, whereas the sugar in DNA is deoxyribose. The only difference between the two is that in deoxyribose, there is an oxygen missing from the 2' carbon …(there is a H there instead of an OH). This makes DNA more stable/less reactive than RNA. 1 person found this useful.",
'3092885': 'All three C-Ph bonds are typical of sp 3 - sp 2 carbon-carbon bonds with lengths of approximately 1.47 A, å while The-C o bond length is approximately.1 42. A å the presence of three adjacent phenyl groups confers special properties manifested in the reactivity of. the alcohol',
'7275560': 'shortest phase of mitosis Anaphase is the shortest phase of mitosis. During anaphase the arranged chromosomes at the metaphase plate are migrate towards their respective poles. Before this migration started, chromosomes are divided into sister chromatids, by the separation of joined centromere of two sister chromatids of a chromosomes.'}
The size of the sampled document corpus is equal to 2 * n_relevant
+ n_irrelevant
.
len(sample.corpus)
5
Sampled queries are dict containing query id as key and query text as value.
print(sample.train_queries)
print(sample.dev_queries)
{'899723': 'what sugar is found in rna'}
{'994205': 'which is the shortest stage in duration'}
Sampled qrels contains one relevant document for each query.
print(sample.train_qrels)
print(sample.dev_qrels)
{'899723': {'6096573': 1}}
{'994205': {'7275560': 1}}
The following relevant documents are guaranteed to be included in the corpus_sample
.
['6096573', '7275560']
Basic search
Code related to a basic search search engine for passage ranking.
create_basic_search_package
create_basic_search_package (name:str='PassageRanking')
Create a basic Vespa application package for passage ranking.
Vespa fields:
The application contain two string fields: doc_id
and text
.
Vespa rank functions:
The application contain two rank profiles: bm25 and nativeRank.
Type | Default | Details | |
---|---|---|---|
name | str | PassageRanking | Name of the application |
Returns | ApplicationPackage | pyvespa ApplicationPackage instance. |
Usage:
= create_basic_search_package(name="PassageModuleApp") app_package
Check how the Vespa schema definition for this application looks like:
print(app_package.schema.schema_to_text)
schema PassageModuleApp {
document PassageModuleApp {
field doc_id type string {
indexing: attribute | summary
}
field text type string {
indexing: index | summary
index: enable-bm25
}
}
fieldset default {
fields: text
}
rank-profile bm25 {
first-phase {
expression: bm25(text)
}
summary-features {
bm25(text)
}
}
rank-profile native_rank {
first-phase {
expression: nativeRank(text)
}
}
}
Evaluate query models
evaluate_query_models
evaluate_query_models (app_package:vespa.package.ApplicationPackage, query_models:List[learntorank.query.QueryModel], metrics:List[learntorank.evaluation.EvalMetric], corpus_size:List[int], output_file_path:str, dev_query_percentage:float=0.006285807802305023, verbose:bool=True, **kwargs)
from learntorank.evaluation import (
MatchRatio,
Recall,
ReciprocalRank,
NormalizedDiscountedCumulativeGain
)from learntorank.query import QueryModel, OR, Ranking
= [100, 200]
corpus_size = create_basic_search_package(name="PassageEvaluationApp")
app_package = [
query_models
QueryModel(="bm25",
name=OR(),
match_phase=Ranking(name="bm25")
ranking
),
QueryModel(="native_rank",
name=OR(),
match_phase=Ranking(name="native_rank")
ranking
)
]= [
metrics
MatchRatio(),=100),
Recall(at=10),
ReciprocalRank(at=10)
NormalizedDiscountedCumulativeGain(at
]= "test.csv" output_file_path
= evaluate_query_models(
estimates =app_package,
app_package=query_models,
query_models=metrics,
metrics=corpus_size,
corpus_size=0.5,
dev_query_percentage=output_file_path,
output_file_path=False
verbose )
*****
Deploy Vespa application:
*****
Waiting for configuration server, 0/300 seconds...
Waiting for configuration server, 5/300 seconds...
Waiting for configuration server, 10/300 seconds...
Waiting for application status, 0/300 seconds...
Waiting for application status, 5/300 seconds...
Waiting for application status, 10/300 seconds...
Waiting for application status, 15/300 seconds...
Waiting for application status, 20/300 seconds...
Waiting for application status, 25/300 seconds...
Waiting for application status, 30/300 seconds...
Waiting for application status, 35/300 seconds...
Waiting for application status, 40/300 seconds...
Waiting for application status, 45/300 seconds...
Waiting for application status, 50/300 seconds...
Waiting for application status, 55/300 seconds...
Waiting for application status, 60/300 seconds...
Waiting for application status, 65/300 seconds...
Waiting for application status, 70/300 seconds...
Waiting for application status, 75/300 seconds...
Waiting for application status, 80/300 seconds...