and_filter = AND()query
Match Filters
MatchFilter
MatchFilter ()
Abstract class for match filters.
AND
AND ()
Filter that match document containing all the query terms.
Usage: The AND filter is usually used when specifying query models.
OR
OR ()
Filter that match any document containing at least one query term.
Usage: The OR filter is usually used when specifying query models.
or_filter = OR()WeakAnd
WeakAnd (hits:int, field:str='default')
Match documents according to the weakAND algorithm.
Reference: https://docs.vespa.ai/en/using-wand-with-vespa.html
| Type | Default | Details | |
|---|---|---|---|
| hits | int | Lower bound on the number of hits to be retrieved. | |
| field | str | default | Which Vespa field to search. |
| Returns | None |
Usage: The WeakAnd filter is usually used when specifying query models.
weakand_filter = WeakAnd(hits=10, field="default")Tokenize
Tokenize (hits:int, field:str='default')
Match documents according to the weakAND algorithm without parsing specials characters.
Reference: https://docs.vespa.ai/en/reference/simple-query-language-reference.html
| Type | Default | Details | |
|---|---|---|---|
| hits | int | Lower bound on the number of hits to be retrieved. | |
| field | str | default | Which Vespa field to search. |
| Returns | None |
Usage: The Tokenize filter is usually used when specifying query models.
tokenize_filter = Tokenize(hits=10, field="default")ANN
ANN (doc_vector:str, query_vector:str, hits:int, label:str, approximate:bool=True)
Match documents according to the nearest neighbor operator.
Reference: https://docs.vespa.ai/en/reference/query-language-reference.html
| Type | Default | Details | |
|---|---|---|---|
| doc_vector | str | Name of the document field to be used in the distance calculation. | |
| query_vector | str | Name of the query field to be used in the distance calculation. | |
| hits | int | Lower bound on the number of hits to return. | |
| label | str | A label to identify this specific operator instance. | |
| approximate | bool | True | True to use approximate nearest neighbor and False to use brute force. Default to True. |
| Returns | None |
Usage: The ANN filter is usually used when specifying query models.
By default, the ANN operator uses approximate nearest neighbor:
match_filter = ANN(
doc_vector="doc_vector",
query_vector="query_vector",
hits=10,
label="label",
)Brute-force can be used by specifying approximate=False:
ann_filter = ANN(
doc_vector="doc_vector",
query_vector="query_vector",
hits=10,
label="label",
approximate=False,
)Union
Union (*args:__main__.MatchFilter)
Match documents that belongs to the union of many match filters.
| Type | Details | |
|---|---|---|
| args | MatchFilter | |
| Returns | None | Match filters to be taken the union of. |
Usage: The Union filter is usually used when specifying query models.
union_filter = Union(
WeakAnd(hits=10, field="field_name"),
ANN(
doc_vector="doc_vector",
query_vector="query_vector",
hits=10,
label="label",
),
)Ranking
Ranking
Ranking (name:str='default', list_features:bool=False)
Define the rank profile to be used during ranking.
| Type | Default | Details | |
|---|---|---|---|
| name | str | default | Name of the rank profile as defined in a Vespa search definition. |
| list_features | bool | False | Should the ranking features be returned. Either ‘true’ or ‘false’. |
| Returns | None |
Usage: Ranking is usually used when specifying query models.
ranking = Ranking(name="bm25", list_features=True)Query properties
QueryProperty
QueryProperty ()
Abstract class for query property.
QueryRankingFeature
QueryRankingFeature (name:str, mapping:Callable[[str],List[float]])
Include ranking.feature.query into a Vespa query.
| Type | Details | |
|---|---|---|
| name | str | Name of the feature. |
| mapping | typing.Callable[[str], typing.List[float]] | Function mapping a string to a list of floats. |
| Returns | None |
Usage: QueryRankingFeature is usually used when specifying query models.
query_property = QueryRankingFeature(
name="query_vector", mapping=lambda x: [1, 2, 3]
)Query model
QueryModel
QueryModel (name:str='default_name', query_properties:Optional[List[__main__.QueryProperty]]=None, match_phase:__main__.MatchFilter=<__main__.AND object at 0x7fe734343a30>, ranking:__main__.Ranking=<__main__.Ranking object at 0x7fe73305ba60>, body_function:Optional[Callable[[str],Dict]]=None)
Define a query model.
A QueryModel is an abstraction that encapsulates all the relevant information controlling how a Vespa app matches and ranks documents.
| Type | Default | Details | |
|---|---|---|---|
| name | str | default_name | Name of the query model. Used to tag model-related quantities, like evaluation metrics. |
| query_properties | typing.Optional[typing.List[main.QueryProperty]] | None | Query properties to be included in the queries. |
| match_phase | MatchFilter | <main.AND object at 0x7fe734343a30> | Define the match criteria. |
| ranking | Ranking | <main.Ranking object at 0x7fe73305ba60> | Define the rank criteria. |
| body_function | typing.Optional[typing.Callable[[str], typing.Dict]] | None | Function that take query as parameter and returns the body of a Vespa query. |
| Returns | None |
Usage:
Specify a query model with default configurations:
query_model = QueryModel()Specify match phase, ranking phase and properties used by them.
query_model = QueryModel(
query_properties=[
QueryRankingFeature(name="query_embedding", mapping=lambda x: [1, 2, 3])
],
match_phase=ANN(
doc_vector="document_embedding",
query_vector="query_embedding",
hits=10,
label="label",
),
ranking=Ranking(name="bm25_plus_embeddings", list_features=True),
)Specify a query model based on a function that output Vespa YQL.
def body_function(query):
body = {
"yql": "select * from sources * where userQuery();",
"query": query,
"type": "any",
"ranking": {"profile": "bm25", "listFeatures": "true"},
}
return body
query_model = QueryModel(body_function=body_function)Send query with QueryModel
send_query
send_query (app:vespa.application.Vespa, body:Optional[Dict]=None, query:Optional[str]=None, query_model:Optional[__main__.QueryModel]=None, debug_request:bool=False, recall:Optional[Tuple]=None, **kwargs)
Send a query request to a Vespa application.
Either send ‘body’ containing all the request parameters or specify ‘query’ and ‘query_model’.
| Type | Default | Details | |
|---|---|---|---|
| app | Vespa | Connection to a Vespa application | |
| body | typing.Optional[typing.Dict] | None | Contains all the request parameters. None when using query_model. |
| query | typing.Optional[str] | None | Query string. None when using body. |
| query_model | typing.Optional[main.QueryModel] | None | Query model. None when using body. |
| debug_request | bool | False | Return request body for debugging instead of sending the request. |
| recall | typing.Optional[typing.Tuple] | None | Tuple of size 2 where the first element is the name of the field to use to recall and the second element is a list of the values to be recalled. |
| kwargs | |||
| Returns | VespaQueryResponse | Either the request body if debug_request is True or the result from the Vespa application. |
Usage: Assume app is a Vespa connection.
Send request body.
body = {"yql": "select * from sources * where test"}
result = send_query(app=app, body=body)Use query and query_model:
result = send_query(
app=app,
query="this is a test",
query_model=QueryModel(
match_phase=OR(),
ranking=Ranking()
),
hits=10,
)Debug the output of the QueryModel by setting debug_request=True:
send_query(
app=app,
query="this is a test",
query_model=QueryModel(match_phase=OR(), ranking=Ranking()),
debug_request=True,
hits=10,
).request_body{'yql': 'select * from sources * where ({grammar: "any"}userInput("this is a test"));',
'ranking': {'profile': 'default', 'listFeatures': 'false'},
'hits': 10}
Recall documents using the id field:
result = send_query(
app=app,
query="this is a test",
query_model=QueryModel(match_phase=OR(), ranking=Ranking()),
hits=10,
recall=("id", [1, 5]),
)Use a body_function to specify a QueryModel:
def body_function(query):
body = {
"yql": "select * from sources * where userQuery();",
"query": query,
"type": "any",
"ranking": {"profile": "bm25", "listFeatures": "true"},
}
return body
query_model = QueryModel(body_function=body_function)
result = send_query(
app=app,
query="this is a test",
query_model=query_model,
hits=10
)send_query_batch
send_query_batch (app, body_batch:Optional[List[Dict]]=None, query_batch:Optional[List[str]]=None, query_model:Optional[__main__.QueryModel]=None, recall_batch:Optional[List[Tuple]]=None, asynchronous=True, connections:Optional[int]=100, total_timeout:int=100, **kwargs)
Send queries in batch to a Vespa app.
| Type | Default | Details | |
|---|---|---|---|
| app | Connection to a Vespa application | ||
| body_batch | typing.Optional[typing.List[typing.Dict]] | None | Contains all the request parameters. Set to None if using ‘query_batch’. |
| query_batch | typing.Optional[typing.List[str]] | None | Query strings. Set to None if using ‘body_batch’. |
| query_model | typing.Optional[main.QueryModel] | None | Query model to use when sending query strings. Set to None if using ‘body_batch’. |
| recall_batch | typing.Optional[typing.List[typing.Tuple]] | None | One tuple for each query. Tuple of size 2 where the first element is the name of the field to use to recall and the second element is a list of the values to be recalled. |
| asynchronous | bool | True | Set True to send data in async mode. Default to True. |
| connections | typing.Optional[int] | 100 | Number of allowed concurrent connections, valid only if asynchronous=True. |
| total_timeout | int | 100 | Total timeout in secs for each of the concurrent requests when using asynchronous=True. |
| kwargs | |||
| Returns | typing.List[vespa.io.VespaQueryResponse] | HTTP POST responses. |
Use body_batch to send a batch of body requests.
body_batch = [
{"yql": "select * from sources * where test"},
{"yql": "select * from sources * where test2"}
]
result = send_query_batch(app=app, body_batch=body_batch)Use query_batch to send a batch of query strings to be ranked according a QueryModel.
result = send_query_batch(
app=app,
query_batch=["this is a test", "this is a test 2"],
query_model=QueryModel(
match_phase=OR(),
ranking=Ranking()
),
hits=10,
)Use recall_batch to send one tuple for each query in query_batch.
result = send_query_batch(
app=app,
query_batch=["this is a test", "this is a test 2"],
query_model=QueryModel(match_phase=OR(), ranking=Ranking()),
hits=10,
recall_batch=[("doc_id", [2, 7]), ("doc_id", [0, 5])],
)Collect Vespa features
collect_vespa_features
collect_vespa_features (app:vespa.application.Vespa, labeled_data, id_field:str, query_model:__main__.QueryModel, number_additional_docs:int, fields:List[str], keep_features:Optional[List[str]]=None, relevant_score:int=1, default_score:int=0, **kwargs)
Collect Vespa features based on a set of labelled data.
| Type | Default | Details | |
|---|---|---|---|
| app | Vespa | Connection to a Vespa application. | |
| labeled_data | Labelled data containing query, query_id and relevant ids. See examples about data format. | ||
| id_field | str | The Vespa field representing the document id. | |
| query_model | QueryModel | Query model. | |
| number_additional_docs | int | Number of additional documents to retrieve for each relevant document. Duplicate documents will be dropped. | |
| fields | typing.List[str] | Vespa fields to collect, e.g. [“rankfeatures”, “summaryfeatures”] | |
| keep_features | typing.Optional[typing.List[str]] | None | List containing the names of the features that should be returned. Default to None, which return all the features contained in the ‘fields’ argument. |
| relevant_score | int | 1 | Score to assign to relevant documents. Default to 1. |
| default_score | int | 0 | Score to assign to the additional documents that are not relevant. Default to 0. |
| kwargs | |||
| Returns | DataFrame | DataFrame containing document id (document_id), query id (query_id), scores (relevant) and vespa rank features returned by the Query model RankProfile used. |
Usage:
Define labeled_data as a list of dict containing relevant documents:
labeled_data = [
{
"query_id": 0,
"query": "give me title 1",
"relevant_docs": [{"id": "1", "score": 1}],
},
{
"query_id": 1,
"query": "give me title 3",
"relevant_docs": [{"id": "3", "score": 1}],
},
]Collect vespa features:
rank_features = collect_vespa_features(
app=app,
labeled_data=labeled_data,
id_field="doc_id",
query_model=QueryModel(
match_phase=OR(),
ranking=Ranking(name="bm25", list_features=True)
),
number_additional_docs=2,
fields=["rankfeatures"],
)
rank_features| document_id | query_id | label | attributeMatch(doc_id) | attributeMatch(doc_id).averageWeight | attributeMatch(doc_id).completeness | attributeMatch(doc_id).fieldCompleteness | attributeMatch(doc_id).importance | attributeMatch(doc_id).matches | attributeMatch(doc_id).maxWeight | ... | term(3).significance | term(3).weight | term(4).connectedness | term(4).significance | term(4).weight | textSimilarity(text).fieldCoverage | textSimilarity(text).order | textSimilarity(text).proximity | textSimilarity(text).queryCoverage | textSimilarity(text).score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.0 | 1.000000 | 0.50 | 0.750000 |
| 3 | 7 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.859375 | 0.25 | 0.425781 |
| 1 | 3 | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.0 | 1.000000 | 0.50 | 0.750000 |
| 5 | 7 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.859375 | 0.25 | 0.425781 |
4 rows × 94 columns
Use a DataFrame for labeled_data instead of a list of dict:
labeled_data = [
{
"qid": 0,
"query": "give me title 1",
"doc_id": 1,
"relevance": 1
},
{
"qid": 1,
"query": "give me title 3",
"doc_id": 3,
"relevance": 1
},
]
labeled_data_df = DataFrame.from_records(labeled_data)
labeled_data_df| qid | query | doc_id | relevance | |
|---|---|---|---|---|
| 0 | 0 | give me title 1 | 1 | 1 |
| 1 | 1 | give me title 3 | 3 | 1 |
rank_features = collect_vespa_features(
app=app,
labeled_data=labeled_data_df,
id_field="doc_id",
query_model=QueryModel(
match_phase=OR(), ranking=Ranking(name="bm25", list_features=True)
),
number_additional_docs=2,
fields=["rankfeatures"],
)
rank_features| document_id | query_id | label | attributeMatch(doc_id) | attributeMatch(doc_id).averageWeight | attributeMatch(doc_id).completeness | attributeMatch(doc_id).fieldCompleteness | attributeMatch(doc_id).importance | attributeMatch(doc_id).matches | attributeMatch(doc_id).maxWeight | ... | term(3).significance | term(3).weight | term(4).connectedness | term(4).significance | term(4).weight | textSimilarity(text).fieldCoverage | textSimilarity(text).order | textSimilarity(text).proximity | textSimilarity(text).queryCoverage | textSimilarity(text).score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.0 | 1.000000 | 0.50 | 0.750000 |
| 3 | 7 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.859375 | 0.25 | 0.425781 |
| 1 | 3 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.0 | 1.000000 | 0.50 | 0.750000 |
| 5 | 7 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.583333 | 100.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.859375 | 0.25 | 0.425781 |
4 rows × 94 columns
Keep only selected features by specifying their names in the keep_features argument:
rank_features = collect_vespa_features(
app=app,
labeled_data=labeled_data_df,
id_field="doc_id",
query_model=QueryModel(
match_phase=OR(), ranking=Ranking(name="bm25", list_features=True)
),
number_additional_docs=2,
fields=["rankfeatures"],
keep_features=["textSimilarity(text).score"],
)
rank_features| document_id | query_id | label | textSimilarity(text).score | |
|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0.750000 |
| 3 | 7 | 0 | 0 | 0.425781 |
| 1 | 3 | 1 | 0 | 0.750000 |
| 5 | 7 | 1 | 0 | 0.425781 |
store_vespa_features
store_vespa_features (app:vespa.application.Vespa, output_file_path:str, labeled_data, id_field:str, query_model:__main__.QueryModel, number_additional_docs:int, fields:List[str], keep_features:Optional[List[str]]=None, relevant_score:int=1, default_score:int=0, batch_size=1000, **kwargs)
Retrieve Vespa rank features and store them in a .csv file.
| Type | Default | Details | |
|---|---|---|---|
| app | Vespa | Connection to a Vespa application. | |
| output_file_path | str | Path of the .csv output file. It will create the file of it does not exist and append the vespa features to an pre-existing file. | |
| labeled_data | Labelled data containing query, query_id and relevant ids. See details about data format. | ||
| id_field | str | The Vespa field representing the document id. | |
| query_model | QueryModel | Query model. | |
| number_additional_docs | int | Number of additional documents to retrieve for each relevant document. | |
| fields | typing.List[str] | List of Vespa fields to collect, e.g. [“rankfeatures”, “summaryfeatures”] | |
| keep_features | typing.Optional[typing.List[str]] | None | List containing the names of the features that should be returned. Default to None, which return all the features contained in the ‘fields’ argument. |
| relevant_score | int | 1 | Score to assign to relevant documents. |
| default_score | int | 0 | Score to assign to the additional documents that are not relevant. |
| batch_size | int | 1000 | The size of the batch of labeled data points to be processed. |
| kwargs | |||
| Returns | int | returns 0 upon success. |
Usage:
labeled_data = [
{
"query_id": 0,
"query": "give me title 1",
"relevant_docs": [{"id": "1", "score": 1}],
},
{
"query_id": 1,
"query": "give me title 3",
"relevant_docs": [{"id": "3", "score": 1}],
},
]
store_vespa_features(
app=app,
output_file_path="vespa_features.csv",
labeled_data=labeled_data,
id_field="doc_id",
query_model=QueryModel(
match_phase=OR(), ranking=Ranking(name="bm25", list_features=True)
),
number_additional_docs=2,
fields=["rankfeatures", "summaryfeatures"],
)
rank_features = read_csv("vespa_features.csv")
rank_featuresRows collected: 4.
Batch progress: 1/1.
| document_id | query_id | label | attributeMatch(doc_id) | attributeMatch(doc_id).averageWeight | attributeMatch(doc_id).completeness | attributeMatch(doc_id).fieldCompleteness | attributeMatch(doc_id).importance | attributeMatch(doc_id).matches | attributeMatch(doc_id).maxWeight | ... | term(3).weight | term(4).connectedness | term(4).significance | term(4).weight | textSimilarity(text).fieldCoverage | textSimilarity(text).order | textSimilarity(text).proximity | textSimilarity(text).queryCoverage | textSimilarity(text).score | vespa.summaryFeatures.cached | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 100.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.0 | 1.000000 | 0.50 | 0.750000 | 0.0 |
| 1 | 7 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 100.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.859375 | 0.25 | 0.425781 | 0.0 |
| 2 | 3 | 1 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 100.0 | 0.0 | 0.0 | 0.0 | 0.50 | 1.0 | 1.000000 | 0.50 | 0.750000 | 0.0 |
| 3 | 7 | 1 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 100.0 | 0.0 | 0.0 | 0.0 | 0.25 | 0.0 | 0.859375 | 0.25 | 0.425781 | 0.0 |
4 rows × 95 columns