= AND() and_filter
query
Match Filters
MatchFilter
MatchFilter ()
Abstract class for match filters.
AND
AND ()
Filter that match document containing all the query terms.
Usage: The AND
filter is usually used when specifying query models.
OR
OR ()
Filter that match any document containing at least one query term.
Usage: The OR
filter is usually used when specifying query models.
= OR() or_filter
WeakAnd
WeakAnd (hits:int, field:str='default')
Match documents according to the weakAND algorithm.
Reference: https://docs.vespa.ai/en/using-wand-with-vespa.html
Type | Default | Details | |
---|---|---|---|
hits | int | Lower bound on the number of hits to be retrieved. | |
field | str | default | Which Vespa field to search. |
Returns | None |
Usage: The WeakAnd
filter is usually used when specifying query models.
= WeakAnd(hits=10, field="default") weakand_filter
Tokenize
Tokenize (hits:int, field:str='default')
Match documents according to the weakAND algorithm without parsing specials characters.
Reference: https://docs.vespa.ai/en/reference/simple-query-language-reference.html
Type | Default | Details | |
---|---|---|---|
hits | int | Lower bound on the number of hits to be retrieved. | |
field | str | default | Which Vespa field to search. |
Returns | None |
Usage: The Tokenize
filter is usually used when specifying query models.
= Tokenize(hits=10, field="default") tokenize_filter
ANN
ANN (doc_vector:str, query_vector:str, hits:int, label:str, approximate:bool=True)
Match documents according to the nearest neighbor operator.
Reference: https://docs.vespa.ai/en/reference/query-language-reference.html
Type | Default | Details | |
---|---|---|---|
doc_vector | str | Name of the document field to be used in the distance calculation. | |
query_vector | str | Name of the query field to be used in the distance calculation. | |
hits | int | Lower bound on the number of hits to return. | |
label | str | A label to identify this specific operator instance. | |
approximate | bool | True | True to use approximate nearest neighbor and False to use brute force. Default to True. |
Returns | None |
Usage: The ANN
filter is usually used when specifying query models.
By default, the ANN
operator uses approximate nearest neighbor:
= ANN(
match_filter ="doc_vector",
doc_vector="query_vector",
query_vector=10,
hits="label",
label )
Brute-force can be used by specifying approximate=False
:
= ANN(
ann_filter ="doc_vector",
doc_vector="query_vector",
query_vector=10,
hits="label",
label=False,
approximate )
Union
Union (*args:__main__.MatchFilter)
Match documents that belongs to the union of many match filters.
Type | Details | |
---|---|---|
args | MatchFilter | |
Returns | None | Match filters to be taken the union of. |
Usage: The Union
filter is usually used when specifying query models.
= Union(
union_filter =10, field="field_name"),
WeakAnd(hits
ANN(="doc_vector",
doc_vector="query_vector",
query_vector=10,
hits="label",
label
), )
Ranking
Ranking
Ranking (name:str='default', list_features:bool=False)
Define the rank profile to be used during ranking.
Type | Default | Details | |
---|---|---|---|
name | str | default | Name of the rank profile as defined in a Vespa search definition. |
list_features | bool | False | Should the ranking features be returned. Either ‘true’ or ‘false’. |
Returns | None |
Usage: Ranking
is usually used when specifying query models.
= Ranking(name="bm25", list_features=True) ranking
Query properties
QueryProperty
QueryProperty ()
Abstract class for query property.
QueryRankingFeature
QueryRankingFeature (name:str, mapping:Callable[[str],List[float]])
Include ranking.feature.query into a Vespa query.
Type | Details | |
---|---|---|
name | str | Name of the feature. |
mapping | typing.Callable[[str], typing.List[float]] | Function mapping a string to a list of floats. |
Returns | None |
Usage: QueryRankingFeature
is usually used when specifying query models.
= QueryRankingFeature(
query_property ="query_vector", mapping=lambda x: [1, 2, 3]
name )
Query model
QueryModel
QueryModel (name:str='default_name', query_properties:Optional[List[__main__.QueryProperty]]=None, match_phase:__main__.MatchFilter=<__main__.AND object at 0x7fe734343a30>, ranking:__main__.Ranking=<__main__.Ranking object at 0x7fe73305ba60>, body_function:Optional[Callable[[str],Dict]]=None)
Define a query model.
A QueryModel
is an abstraction that encapsulates all the relevant information controlling how a Vespa app matches and ranks documents.
Type | Default | Details | |
---|---|---|---|
name | str | default_name | Name of the query model. Used to tag model-related quantities, like evaluation metrics. |
query_properties | typing.Optional[typing.List[main.QueryProperty]] | None | Query properties to be included in the queries. |
match_phase | MatchFilter | <main.AND object at 0x7fe734343a30> | Define the match criteria. |
ranking | Ranking | <main.Ranking object at 0x7fe73305ba60> | Define the rank criteria. |
body_function | typing.Optional[typing.Callable[[str], typing.Dict]] | None | Function that take query as parameter and returns the body of a Vespa query. |
Returns | None |
Usage:
Specify a query model with default configurations:
= QueryModel() query_model
Specify match phase, ranking phase and properties used by them.
= QueryModel(
query_model =[
query_properties="query_embedding", mapping=lambda x: [1, 2, 3])
QueryRankingFeature(name
],=ANN(
match_phase="document_embedding",
doc_vector="query_embedding",
query_vector=10,
hits="label",
label
),=Ranking(name="bm25_plus_embeddings", list_features=True),
ranking )
Specify a query model based on a function that output Vespa YQL.
def body_function(query):
= {
body "yql": "select * from sources * where userQuery();",
"query": query,
"type": "any",
"ranking": {"profile": "bm25", "listFeatures": "true"},
}return body
= QueryModel(body_function=body_function) query_model
Send query with QueryModel
send_query
send_query (app:vespa.application.Vespa, body:Optional[Dict]=None, query:Optional[str]=None, query_model:Optional[__main__.QueryModel]=None, debug_request:bool=False, recall:Optional[Tuple]=None, **kwargs)
Send a query request to a Vespa application.
Either send ‘body’ containing all the request parameters or specify ‘query’ and ‘query_model’.
Type | Default | Details | |
---|---|---|---|
app | Vespa | Connection to a Vespa application | |
body | typing.Optional[typing.Dict] | None | Contains all the request parameters. None when using query_model . |
query | typing.Optional[str] | None | Query string. None when using body . |
query_model | typing.Optional[main.QueryModel] | None | Query model. None when using body . |
debug_request | bool | False | Return request body for debugging instead of sending the request. |
recall | typing.Optional[typing.Tuple] | None | Tuple of size 2 where the first element is the name of the field to use to recall and the second element is a list of the values to be recalled. |
kwargs | |||
Returns | VespaQueryResponse | Either the request body if debug_request is True or the result from the Vespa application. |
Usage: Assume app
is a Vespa connection.
Send request body.
= {"yql": "select * from sources * where test"}
body = send_query(app=app, body=body) result
Use query
and query_model
:
= send_query(
result =app,
app="this is a test",
query=QueryModel(
query_model=OR(),
match_phase=Ranking()
ranking
),=10,
hits )
Debug the output of the QueryModel
by setting debug_request=True
:
send_query(=app,
app="this is a test",
query=QueryModel(match_phase=OR(), ranking=Ranking()),
query_model=True,
debug_request=10,
hits ).request_body
{'yql': 'select * from sources * where ({grammar: "any"}userInput("this is a test"));',
'ranking': {'profile': 'default', 'listFeatures': 'false'},
'hits': 10}
Recall documents using the id
field:
= send_query(
result =app,
app="this is a test",
query=QueryModel(match_phase=OR(), ranking=Ranking()),
query_model=10,
hits=("id", [1, 5]),
recall )
Use a body_function
to specify a QueryModel
:
def body_function(query):
= {
body "yql": "select * from sources * where userQuery();",
"query": query,
"type": "any",
"ranking": {"profile": "bm25", "listFeatures": "true"},
}return body
= QueryModel(body_function=body_function)
query_model
= send_query(
result =app,
app="this is a test",
query=query_model,
query_model=10
hits )
send_query_batch
send_query_batch (app, body_batch:Optional[List[Dict]]=None, query_batch:Optional[List[str]]=None, query_model:Optional[__main__.QueryModel]=None, recall_batch:Optional[List[Tuple]]=None, asynchronous=True, connections:Optional[int]=100, total_timeout:int=100, **kwargs)
Send queries in batch to a Vespa app.
Type | Default | Details | |
---|---|---|---|
app | Connection to a Vespa application | ||
body_batch | typing.Optional[typing.List[typing.Dict]] | None | Contains all the request parameters. Set to None if using ‘query_batch’. |
query_batch | typing.Optional[typing.List[str]] | None | Query strings. Set to None if using ‘body_batch’. |
query_model | typing.Optional[main.QueryModel] | None | Query model to use when sending query strings. Set to None if using ‘body_batch’. |
recall_batch | typing.Optional[typing.List[typing.Tuple]] | None | One tuple for each query. Tuple of size 2 where the first element is the name of the field to use to recall and the second element is a list of the values to be recalled. |
asynchronous | bool | True | Set True to send data in async mode. Default to True. |
connections | typing.Optional[int] | 100 | Number of allowed concurrent connections, valid only if asynchronous=True . |
total_timeout | int | 100 | Total timeout in secs for each of the concurrent requests when using asynchronous=True . |
kwargs | |||
Returns | typing.List[vespa.io.VespaQueryResponse] | HTTP POST responses. |
Use body_batch
to send a batch of body requests.
= [
body_batch "yql": "select * from sources * where test"},
{"yql": "select * from sources * where test2"}
{
]= send_query_batch(app=app, body_batch=body_batch) result
Use query_batch
to send a batch of query strings to be ranked according a QueryModel
.
= send_query_batch(
result =app,
app=["this is a test", "this is a test 2"],
query_batch=QueryModel(
query_model=OR(),
match_phase=Ranking()
ranking
),=10,
hits )
Use recall_batch
to send one tuple for each query in query_batch
.
= send_query_batch(
result =app,
app=["this is a test", "this is a test 2"],
query_batch=QueryModel(match_phase=OR(), ranking=Ranking()),
query_model=10,
hits=[("doc_id", [2, 7]), ("doc_id", [0, 5])],
recall_batch )
Collect Vespa features
collect_vespa_features
collect_vespa_features (app:vespa.application.Vespa, labeled_data, id_field:str, query_model:__main__.QueryModel, number_additional_docs:int, fields:List[str], keep_features:Optional[List[str]]=None, relevant_score:int=1, default_score:int=0, **kwargs)
Collect Vespa features based on a set of labelled data.
Type | Default | Details | |
---|---|---|---|
app | Vespa | Connection to a Vespa application. | |
labeled_data | Labelled data containing query, query_id and relevant ids. See examples about data format. | ||
id_field | str | The Vespa field representing the document id. | |
query_model | QueryModel | Query model. | |
number_additional_docs | int | Number of additional documents to retrieve for each relevant document. Duplicate documents will be dropped. | |
fields | typing.List[str] | Vespa fields to collect, e.g. [“rankfeatures”, “summaryfeatures”] | |
keep_features | typing.Optional[typing.List[str]] | None | List containing the names of the features that should be returned. Default to None, which return all the features contained in the ‘fields’ argument. |
relevant_score | int | 1 | Score to assign to relevant documents. Default to 1. |
default_score | int | 0 | Score to assign to the additional documents that are not relevant. Default to 0. |
kwargs | |||
Returns | DataFrame | DataFrame containing document id (document_id), query id (query_id), scores (relevant) and vespa rank features returned by the Query model RankProfile used. |
Usage:
Define labeled_data
as a list of dict containing relevant documents:
= [
labeled_data
{"query_id": 0,
"query": "give me title 1",
"relevant_docs": [{"id": "1", "score": 1}],
},
{"query_id": 1,
"query": "give me title 3",
"relevant_docs": [{"id": "3", "score": 1}],
}, ]
Collect vespa features:
= collect_vespa_features(
rank_features =app,
app=labeled_data,
labeled_data="doc_id",
id_field=QueryModel(
query_model=OR(),
match_phase=Ranking(name="bm25", list_features=True)
ranking
),=2,
number_additional_docs=["rankfeatures"],
fields
) rank_features
document_id | query_id | label | attributeMatch(doc_id) | attributeMatch(doc_id).averageWeight | attributeMatch(doc_id).completeness | attributeMatch(doc_id).fieldCompleteness | attributeMatch(doc_id).importance | attributeMatch(doc_id).matches | attributeMatch(doc_id).maxWeight | ... | term(3).significance | term(3).weight | term(4).connectedness | term(4).significance | term(4).weight | textSimilarity(text).fieldCoverage | textSimilarity(text).order | textSimilarity(text).proximity | textSimilarity(text).queryCoverage | textSimilarity(text).score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.0 | 1.000000 | 0.50 | 0.750000 |
3 | 7 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.859375 | 0.25 | 0.425781 |
1 | 3 | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.0 | 1.000000 | 0.50 | 0.750000 |
5 | 7 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.859375 | 0.25 | 0.425781 |
4 rows × 94 columns
Use a DataFrame
for labeled_data
instead of a list of dict:
= [
labeled_data
{"qid": 0,
"query": "give me title 1",
"doc_id": 1,
"relevance": 1
},
{"qid": 1,
"query": "give me title 3",
"doc_id": 3,
"relevance": 1
},
]= DataFrame.from_records(labeled_data)
labeled_data_df labeled_data_df
qid | query | doc_id | relevance | |
---|---|---|---|---|
0 | 0 | give me title 1 | 1 | 1 |
1 | 1 | give me title 3 | 3 | 1 |
= collect_vespa_features(
rank_features =app,
app=labeled_data_df,
labeled_data="doc_id",
id_field=QueryModel(
query_model=OR(), ranking=Ranking(name="bm25", list_features=True)
match_phase
),=2,
number_additional_docs=["rankfeatures"],
fields
) rank_features
document_id | query_id | label | attributeMatch(doc_id) | attributeMatch(doc_id).averageWeight | attributeMatch(doc_id).completeness | attributeMatch(doc_id).fieldCompleteness | attributeMatch(doc_id).importance | attributeMatch(doc_id).matches | attributeMatch(doc_id).maxWeight | ... | term(3).significance | term(3).weight | term(4).connectedness | term(4).significance | term(4).weight | textSimilarity(text).fieldCoverage | textSimilarity(text).order | textSimilarity(text).proximity | textSimilarity(text).queryCoverage | textSimilarity(text).score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.0 | 1.000000 | 0.50 | 0.750000 |
3 | 7 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.859375 | 0.25 | 0.425781 |
1 | 3 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.0 | 1.000000 | 0.50 | 0.750000 |
5 | 7 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.859375 | 0.25 | 0.425781 |
4 rows × 94 columns
Keep only selected features by specifying their names in the keep_features
argument:
= collect_vespa_features(
rank_features =app,
app=labeled_data_df,
labeled_data="doc_id",
id_field=QueryModel(
query_model=OR(), ranking=Ranking(name="bm25", list_features=True)
match_phase
),=2,
number_additional_docs=["rankfeatures"],
fields=["textSimilarity(text).score"],
keep_features
) rank_features
document_id | query_id | label | textSimilarity(text).score | |
---|---|---|---|---|
0 | 1 | 0 | 0 | 0.750000 |
3 | 7 | 0 | 0 | 0.425781 |
1 | 3 | 1 | 0 | 0.750000 |
5 | 7 | 1 | 0 | 0.425781 |
store_vespa_features
store_vespa_features (app:vespa.application.Vespa, output_file_path:str, labeled_data, id_field:str, query_model:__main__.QueryModel, number_additional_docs:int, fields:List[str], keep_features:Optional[List[str]]=None, relevant_score:int=1, default_score:int=0, batch_size=1000, **kwargs)
Retrieve Vespa rank features and store them in a .csv file.
Type | Default | Details | |
---|---|---|---|
app | Vespa | Connection to a Vespa application. | |
output_file_path | str | Path of the .csv output file. It will create the file of it does not exist and append the vespa features to an pre-existing file. | |
labeled_data | Labelled data containing query, query_id and relevant ids. See details about data format. | ||
id_field | str | The Vespa field representing the document id. | |
query_model | QueryModel | Query model. | |
number_additional_docs | int | Number of additional documents to retrieve for each relevant document. | |
fields | typing.List[str] | List of Vespa fields to collect, e.g. [“rankfeatures”, “summaryfeatures”] | |
keep_features | typing.Optional[typing.List[str]] | None | List containing the names of the features that should be returned. Default to None, which return all the features contained in the ‘fields’ argument. |
relevant_score | int | 1 | Score to assign to relevant documents. |
default_score | int | 0 | Score to assign to the additional documents that are not relevant. |
batch_size | int | 1000 | The size of the batch of labeled data points to be processed. |
kwargs | |||
Returns | int | returns 0 upon success. |
Usage:
= [
labeled_data
{"query_id": 0,
"query": "give me title 1",
"relevant_docs": [{"id": "1", "score": 1}],
},
{"query_id": 1,
"query": "give me title 3",
"relevant_docs": [{"id": "3", "score": 1}],
},
]
store_vespa_features(=app,
app="vespa_features.csv",
output_file_path=labeled_data,
labeled_data="doc_id",
id_field=QueryModel(
query_model=OR(), ranking=Ranking(name="bm25", list_features=True)
match_phase
),=2,
number_additional_docs=["rankfeatures", "summaryfeatures"],
fields
)= read_csv("vespa_features.csv")
rank_features rank_features
Rows collected: 4.
Batch progress: 1/1.
document_id | query_id | label | attributeMatch(doc_id) | attributeMatch(doc_id).averageWeight | attributeMatch(doc_id).completeness | attributeMatch(doc_id).fieldCompleteness | attributeMatch(doc_id).importance | attributeMatch(doc_id).matches | attributeMatch(doc_id).maxWeight | ... | term(3).weight | term(4).connectedness | term(4).significance | term(4).weight | textSimilarity(text).fieldCoverage | textSimilarity(text).order | textSimilarity(text).proximity | textSimilarity(text).queryCoverage | textSimilarity(text).score | vespa.summaryFeatures.cached | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 100.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.0 | 1.000000 | 0.50 | 0.750000 | 0.0 |
1 | 7 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 100.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.859375 | 0.25 | 0.425781 | 0.0 |
2 | 3 | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 100.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.0 | 1.000000 | 0.50 | 0.750000 | 0.0 |
3 | 7 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 100.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.859375 | 0.25 | 0.425781 | 0.0 |
4 rows × 95 columns