Skip to content

Evaluation

vespa.evaluation

Vespa evaluation module.

This module provides tools for evaluating and benchmarking Vespa applications.

Vespa(url, port=None, deployment_message=None, cert=None, key=None, vespa_cloud_secret_token=None, output_file=sys.stdout, application_package=None, additional_headers=None)

Bases: object

Establish a connection with an existing Vespa application.

Parameters:

Name Type Description Default
url str

Vespa endpoint URL.

required
port int

Vespa endpoint port.

None
deployment_message str

Message returned by Vespa engine after deployment. Used internally by deploy methods.

None
cert str

Path to data plane certificate and key file in case the 'key' parameter is None. If 'key' is not None, this should be the path of the certificate file. Typically generated by Vespa-cli with 'vespa auth cert'.

None
key str

Path to the data plane key file. Typically generated by Vespa-cli with 'vespa auth cert'.

None
vespa_cloud_secret_token str

Vespa Cloud data plane secret token.

None
output_file str

Output file to write output messages.

stdout
application_package str

Application package definition used to deploy the application.

None
additional_headers dict

Additional headers to be sent to the Vespa application.

None
Example usage
Vespa(url="https://cord19.vespa.ai")   # doctest: +SKIP

Vespa(url="http://localhost", port=8080)
Vespa(http://localhost, 8080)

Vespa(url="https://token-endpoint..z.vespa-app.cloud", vespa_cloud_secret_token="your_token")  # doctest: +SKIP

Vespa(url="https://mtls-endpoint..z.vespa-app.cloud", cert="/path/to/cert.pem", key="/path/to/key.pem")  # doctest: +SKIP

Vespa(url="https://mtls-endpoint..z.vespa-app.cloud", cert="/path/to/cert.pem", key="/path/to/key.pem", additional_headers={"X-Custom-Header": "test"})  # doctest: +SKIP

application_package property

Get application package definition, if available.

asyncio(connections=1, total_timeout=None, timeout=httpx.Timeout(5.0, read=30.0), client=None, **kwargs)

Access Vespa asynchronous connection layer. Should be used as a context manager.

Example usage
async with app.asyncio() as async_app:
    response = await async_app.query(body=body)

# passing kwargs
limits = httpx.Limits(max_keepalive_connections=5, max_connections=5, keepalive_expiry=15)
timeout = httpx.Timeout(connect=3, read=4, write=2, pool=5)
async with app.asyncio(connections=5, timeout=timeout, limits=limits) as async_app:
    response = await async_app.query(body=body)

See VespaAsync for more details on the parameters.

Parameters:

Name Type Description Default
connections int

Number of maximum_keepalive_connections.

1
total_timeout int

Deprecated. Will be ignored. Use timeout instead.

None
timeout Timeout

httpx.Timeout object. See Timeouts. Defaults to 5 seconds for connect/write/pool and 30 seconds for read.

Timeout(5.0, read=30.0)
client AsyncClient

Reusable httpx.AsyncClient to use instead of creating a new one. When provided, the caller is responsible for closing the client.

None
**kwargs dict

Additional arguments to be passed to the httpx.AsyncClient.

{}

Returns:

Name Type Description
VespaAsync VespaAsync

Instance of Vespa asynchronous layer.

get_async_session(connections=1, total_timeout=None, timeout=httpx.Timeout(5.0, read=30.0), **kwargs)

Return a configured httpx.AsyncClient for reuse.

The client is created with the same configuration as VespaAsync and is HTTP/2 enabled by default. Callers are responsible for closing the client via await client.aclose() when finished.

Parameters:

Name Type Description Default
connections int

Number of logical connections to keep alive.

1
timeout Timeout | int

Timeout configuration for the client.

Timeout(5.0, read=30.0)
**kwargs

Additional keyword arguments forwarded to httpx.AsyncClient.

{}

Returns:

Type Description
AsyncClient

httpx.AsyncClient: Configured asynchronous HTTP client.

syncio(connections=8, compress='auto', session=None)

Access Vespa synchronous connection layer. Should be used as a context manager.

Example usage:

```python
with app.syncio() as sync_app:
    response = sync_app.query(body=body)
```

See for more details.

Parameters:

Name Type Description Default
connections int

Number of allowed concurrent connections.

8
total_timeout float

Total timeout in seconds.

required
compress Union[str, bool]

Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes.

'auto'
session Session

Reusable requests session to utilise for all requests made within the context manager. When provided, the caller is responsible for closing the session.

None

Returns:

Name Type Description
VespaAsyncLayer VespaSync

Instance of Vespa asynchronous layer.

get_sync_session(connections=8, compress='auto')

Return a configured requests.Session for reuse.

The returned session is configured with the same headers, authentication, and connection pooling behaviour as the VespaSync context manager. Callers are responsible for closing the session when it is no longer needed.

Parameters:

Name Type Description Default
connections int

Number of allowed concurrent connections.

8
compress Union[str, bool]

Whether to compress request bodies.

'auto'

Returns:

Name Type Description
Session Session

Configured requests session using CustomHTTPAdapter pooling.

wait_for_application_up(max_wait=300)

Wait for application endpoint ready (/ApplicationStatus).

Parameters:

Name Type Description Default
max_wait int

Seconds to wait for the application endpoint.

300

Raises:

Type Description
RuntimeError

If not able to reach endpoint within max_wait or the client fails to authenticate.

Returns:

Type Description
None

None

get_application_status()

Get application status (/ApplicationStatus).

Returns:

Type Description
Optional[Response]

None

get_model_endpoint(model_id=None)

Get stateless model evaluation endpoints.

query(body=None, groupname=None, streaming=False, profile=False, **kwargs)

Send a query request to the Vespa application.

Send 'body' containing all the request parameters.

Parameters:

Name Type Description Default
body dict

Dictionary containing request parameters.

None
groupname str

The groupname used with streaming search.

None
streaming bool

Whether to use streaming mode (SSE). Defaults to False.

False
profile bool

Add profiling parameters to the query (response may be large). Defaults to False.

False
**kwargs dict

Extra Vespa Query API parameters.

{}

Returns:

Type Description
Union[VespaQueryResponse, Generator[str, None, None]]

VespaQueryResponse when streaming=False, or a generator of decoded lines when streaming=True.

feed_data_point(schema, data_id, fields, namespace=None, groupname=None, compress='auto', **kwargs)

Feed a data point to a Vespa app. Will create a new VespaSync with connection overhead.

Example usage
app = Vespa(url="localhost", port=8080)
data_id = "1",
fields = {
        "field1": "value1",
    }
with VespaSync(app) as sync_app:
    response = sync_app.feed_data_point(
        schema="schema_name",
        data_id=data_id,
        fields=fields
    )
print(response)

Parameters:

Name Type Description Default
schema str

The schema that we are sending data to.

required
data_id str

Unique id associated with this data point.

required
fields dict

Dictionary containing all the fields required by the schema.

required
namespace str

The namespace that we are sending data to.

None
groupname str

The groupname that we are sending data to.

None
compress Union[str, bool]

Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes.

'auto'

Returns:

Name Type Description
VespaResponse VespaResponse

The response of the HTTP POST request.

feed_iterable(iter, schema=None, namespace=None, callback=None, operation_type='feed', max_queue_size=1000, max_workers=8, max_connections=16, compress='auto', **kwargs)

Feed data from an Iterable of Dict with the keys 'id' and 'fields' to be used in the feed_data_point function.

Uses a queue to feed data in parallel with a thread pool. The result of each operation is forwarded to the user-provided callback function that can process the returned VespaResponse.

Example usage
app = Vespa(url="localhost", port=8080)
data = [
    {"id": "1", "fields": {"field1": "value1"}},
    {"id": "2", "fields": {"field1": "value2"}},
]
def callback(response, id):
    print(f"Response for id {id}: {response.status_code}")
app.feed_iterable(data, schema="schema_name", callback=callback)

Parameters:

Name Type Description Default
iter Iterable[dict]

An iterable of Dict containing the keys 'id' and 'fields' to be used in the feed_data_point. Note that this 'id' is only the last part of the full document id, which will be generated automatically by pyvespa.

required
schema str

The Vespa schema name that we are sending data to.

None
namespace str

The Vespa document id namespace. If no namespace is provided, the schema is used.

None
callback function

A callback function to be called on each result. Signature callback(response: VespaResponse, id: str).

None
operation_type str

The operation to perform. Defaults to feed. Valid values are feed, update, or delete.

'feed'
max_queue_size int

The maximum size of the blocking queue and max in-flight operations.

1000
max_workers int

The maximum number of workers in the threadpool executor.

8
max_connections int

The maximum number of persisted connections to the Vespa endpoint.

16
compress Union[str, bool]

Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes.

'auto'
**kwargs dict

Additional parameters passed to the respective operation type specific function (_data_point).

{}

Returns:

Type Description

None

feed_async_iterable(iter, schema=None, namespace=None, callback=None, operation_type='feed', max_queue_size=1000, max_workers=64, max_connections=1, **kwargs)

Feed data asynchronously using httpx.AsyncClient with HTTP/2. Feed from an Iterable of Dict with the keys 'id' and 'fields' to be used in the feed_data_point function. The result of each operation is forwarded to the user-provided callback function that can process the returned VespaResponse. Prefer using this method over feed_iterable when the operation is I/O bound from the client side.

Example usage
app = Vespa(url="localhost", port=8080)
data = [
    {"id": "1", "fields": {"field1": "value1"}},
    {"id": "2", "fields": {"field1": "value2"}},
]
def callback(response, id):
    print(f"Response for id {id}: {response.status_code}")
app.feed_async_iterable(data, schema="schema_name", callback=callback)

Parameters:

Name Type Description Default
iter Iterable[dict]

An iterable of Dict containing the keys 'id' and 'fields' to be used in the feed_data_point. Note that this 'id' is only the last part of the full document id, which will be generated automatically by pyvespa.

required
schema str

The Vespa schema name that we are sending data to.

None
namespace str

The Vespa document id namespace. If no namespace is provided, the schema is used.

None
callback function

A callback function to be called on each result. Signature callback(response: VespaResponse, id: str).

None
operation_type str

The operation to perform. Defaults to feed. Valid values are feed, update, or delete.

'feed'
max_queue_size int

The maximum number of tasks waiting to be processed. Useful to limit memory usage. Default is 1000.

1000
max_workers int

Maximum number of concurrent requests to have in-flight, bound by an asyncio.Semaphore, that needs to be acquired by a submit task. Increase if the server is scaled to handle more requests.

64
max_connections int

The maximum number of connections passed to httpx.AsyncClient to the Vespa endpoint. As HTTP/2 is used, only one connection is needed.

1
**kwargs dict

Additional parameters passed to the respective operation type-specific function (_data_point).

{}

Returns:

Type Description

None

query_many_async(queries, num_connections=1, max_concurrent=100, adaptive=True, client_kwargs={}, **query_kwargs) async

Execute many queries asynchronously using httpx.AsyncClient. Number of concurrent requests is controlled by the max_concurrent parameter. Each query will be retried up to 3 times using an exponential backoff strategy.

When adaptive=True (default), an AdaptiveThrottler is used that starts with a conservative concurrency limit and automatically adjusts based on server responses to prevent overloading Vespa with expensive operations.

Parameters:

Name Type Description Default
queries Iterable[dict]

Iterable of query bodies (dictionaries) to be sent.

required
num_connections int

Number of connections to be used in the asynchronous client (uses HTTP/2). Defaults to 1.

1
max_concurrent int

Maximum concurrent requests to be sent. Defaults to 100. Be careful with increasing too much.

100
adaptive bool

Use adaptive throttling. Defaults to True. When True, starts with lower concurrency and adjusts based on error rates.

True
client_kwargs dict

Additional arguments to be passed to the httpx.AsyncClient.

{}
**query_kwargs dict

Additional arguments to be passed to the query method.

{}

Returns:

Type Description
List[VespaQueryResponse]

List[VespaQueryResponse]: List of VespaQueryResponse objects.

query_many(queries, num_connections=1, max_concurrent=100, adaptive=True, client_kwargs={}, **query_kwargs)

Execute many queries asynchronously using httpx.AsyncClient. This method is a wrapper around the query_many_async method that uses the asyncio event loop to run the coroutine. Number of concurrent requests is controlled by the max_concurrent parameter. Each query will be retried up to 3 times using an exponential backoff strategy.

When adaptive=True (default), an AdaptiveThrottler is used that starts with a conservative concurrency limit and automatically adjusts based on server responses to prevent overloading Vespa with expensive operations.

Parameters:

Name Type Description Default
queries Iterable[dict]

Iterable of query bodies (dictionaries) to be sent.

required
num_connections int

Number of connections to be used in the asynchronous client (uses HTTP/2). Defaults to 1.

1
max_concurrent int

Maximum concurrent requests to be sent. Defaults to 100. Be careful with increasing too much.

100
adaptive bool

Use adaptive throttling. Defaults to True. When True, starts with lower concurrency and adjusts based on error rates.

True
client_kwargs dict

Additional arguments to be passed to the httpx.AsyncClient.

{}
**query_kwargs dict

Additional arguments to be passed to the query method.

{}

Returns:

Type Description
List[VespaQueryResponse]

List[VespaQueryResponse]: List of VespaQueryResponse objects.

delete_data(schema, data_id, namespace=None, groupname=None, **kwargs)

Delete a data point from a Vespa app.

Example usage
app = Vespa(url="localhost", port=8080)
response = app.delete_data(schema="schema_name", data_id="1")
print(response)

Parameters:

Name Type Description Default
schema str

The schema that we are deleting data from.

required
data_id str

Unique id associated with this data point.

required
namespace str

The namespace that we are deleting data from. If no namespace is provided, the schema is used.

None
groupname str

The groupname that we are deleting data from.

None
**kwargs dict

Additional arguments to be passed to the HTTP DELETE request. See Vespa API documentation for more details.

{}

Returns:

Name Type Description
Response VespaResponse

The response of the HTTP DELETE request.

delete_all_docs(content_cluster_name, schema, namespace=None, slices=1, **kwargs)

Delete all documents associated with the schema. This might block for a long time as it requires sending multiple delete requests to complete.

Parameters:

Name Type Description Default
content_cluster_name str

Name of content cluster to GET from, or visit.

required
schema str

The schema that we are deleting data from.

required
namespace str

The namespace that we are deleting data from. If no namespace is provided, the schema is used.

None
slices int

Number of slices to use for parallel delete requests. Defaults to 1.

1
**kwargs dict

Additional arguments to be passed to the HTTP DELETE request. See Vespa API documentation for more details.

{}

Returns:

Name Type Description
Response Response

The response of the HTTP DELETE request.

visit(content_cluster_name, schema=None, namespace=None, slices=1, selection='true', wanted_document_count=500, slice_id=None, **kwargs)

Visit all documents associated with the schema and matching the selection.

Will run each slice on a separate thread, for each slice yields the response for each page.

Example usage
for slice in app.visit(schema="schema_name", slices=2):
    for response in slice:
        print(response.json)

Parameters:

Name Type Description Default
content_cluster_name str

Name of content cluster to GET from.

required
schema str

The schema that we are visiting data from.

None
namespace str

The namespace that we are visiting data from.

None
slices int

Number of slices to use for parallel GET.

1
selection str

Selection expression to filter documents.

'true'
wanted_document_count int

Best effort number of documents to retrieve for each request. May contain less if there are not enough documents left.

500
slice_id int

Slice id to use for the visit. If None, all slices will be used.

None
**kwargs dict

Additional HTTP request parameters. See Vespa API documentation.

{}

Yields:

Type Description
Generator[VespaVisitResponse, None, None]

Generator[Generator[Response]]: A generator of slices, each containing a generator of responses.

Raises:

Type Description
HTTPError

If an HTTP error occurred.

get_data(schema, data_id, namespace=None, groupname=None, raise_on_not_found=False, **kwargs)

Get a data point from a Vespa app.

Parameters:

Name Type Description Default
data_id str

Unique id associated with this data point.

required
schema str

The schema that we are getting data from. Will attempt to infer schema name if not provided.

required
namespace str

The namespace that we are getting data from. If no namespace is provided, the schema is used.

None
groupname str

The groupname that we are getting data from.

None
raise_on_not_found bool

Raise an exception if the data_id is not found. Default is False.

False
**kwargs dict

Additional arguments to be passed to the HTTP GET request. See Vespa API documentation.

{}

Returns:

Name Type Description
Response VespaResponse

The response of the HTTP GET request.

update_data(schema, data_id, fields, create=False, namespace=None, groupname=None, compress='auto', **kwargs)

Update a data point in a Vespa app.

Example usage
vespa = Vespa(url="localhost", port=8080)

fields = {"mystringfield": "value1", "myintfield": 42}
response = vespa.update_data(schema="schema_name", data_id="id1", fields=fields)
# or, with partial update, setting auto_assign=False
fields = {"myintfield": {"increment": 1}}
response = vespa.update_data(schema="schema_name", data_id="id1", fields=fields, auto_assign=False)
print(response.json)

Parameters:

Name Type Description Default
schema str

The schema that we are updating data.

required
data_id str

Unique id associated with this data point.

required
fields dict

Dict containing all the fields you want to update.

required
create bool

If true, updates to non-existent documents will create an empty document to update.

False
auto_assign bool

Assumes fields-parameter is an assignment operation. If set to false, the fields parameter should be a dictionary including the update operation.

required
namespace str

The namespace that we are updating data. If no namespace is provided, the schema is used.

None
groupname str

The groupname that we are updating data.

None
compress Union[str, bool]

Whether to compress the request body. Defaults to "auto", which will compress if the body is larger than 1024 bytes.

'auto'
**kwargs dict

Additional arguments to be passed to the HTTP PUT request. See Vespa API documentation.

{}

Returns:

Name Type Description
Response VespaResponse

The response of the HTTP PUT request.

get_model_from_application_package(model_name)

Get model definition from application package, if available.

predict(x, model_id, function_name='output_0')

Obtain a stateless model evaluation.

Parameters:

Name Type Description Default
x various

Input where the format depends on the task that the model is serving.

required
model_id str

The id of the model used to serve the prediction.

required
function_name str

The name of the output function to be evaluated.

'output_0'

Returns:

Name Type Description
var

Model prediction.

get_document_v1_path(id, schema=None, namespace=None, group=None, number=None)

Convert to document v1 path.

Parameters:

Name Type Description Default
id str

The id of the document.

required
namespace str

The namespace of the document.

None
schema str

The schema of the document.

None
group str

The group of the document.

None
number int

The number of the document.

None

Returns:

Name Type Description
str str

The path to the document v1 endpoint.

VespaQueryResponse(json, status_code, url, request_body=None)

Bases: VespaResponse

get_json()

For debugging when the response does not have hits.

Returns:

Type Description
Dict

JSON object with full response

RandomHitsSamplingStrategy

Bases: Enum

Enum for different random hits sampling strategies.

  • RATIO: Sample random hits as a ratio of relevant docs (e.g., 1.0 = equal number, 2.0 = twice as many)
  • FIXED: Sample a fixed number of random hits per query

VespaEvaluatorBase(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', write_csv=False, csv_dir=None)

Bases: ABC

Abstract base class for Vespa evaluators providing initialization and interface.

run() abstractmethod

Abstract method to be implemented by subclasses.

__call__()

Make the evaluator callable.

VespaEvaluator(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', accuracy_at_k=[1, 3, 5, 10], precision_recall_at_k=[1, 3, 5, 10], mrr_at_k=[10], ndcg_at_k=[10], map_at_k=[100], write_csv=False, csv_dir=None)

Bases: VespaEvaluatorBase

Evaluate retrieval performance on a Vespa application.

This class:

  • Iterates over queries and issues them against your Vespa application.
  • Retrieves top-k documents per query (with k = max of your IR metrics).
  • Compares the retrieved documents with a set of relevant document ids.
  • Computes IR metrics: Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k.
  • Logs vespa search times for each query.
  • Logs/returns these metrics.
  • Optionally writes out to CSV.

Note: The 'id_field' needs to be marked as an attribute in your Vespa schema, so filtering can be done on it.

Example usage
from vespa.application import Vespa
from vespa.evaluation import VespaEvaluator

queries = {
    "q1": "What is the best GPU for gaming?",
    "q2": "How to bake sourdough bread?",
    # ...
}
relevant_docs = {
    "q1": {"d12", "d99"},
    "q2": {"d101"},
    # ...
}
# relevant_docs can also be a dict of query_id => single relevant doc_id
# relevant_docs = {
#     "q1": "d12",
#     "q2": "d101",
#     # ...
# }
# Or, relevant_docs can be a dict of query_id => map of doc_id => relevance
# relevant_docs = {
#     "q1": {"d12": 1, "d99": 0.1},
#     "q2": {"d101": 0.01},
#     # ...
# Note that for non-binary relevance, the relevance values should be in [0, 1], and that
# only the nDCG metric will be computed.

def my_vespa_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": 'select * from sources * where userInput("' + query_text + '");',
        "hits": top_k,
        "ranking": "your_ranking_profile",
    }

app = Vespa(url="http://localhost", port=8080)

evaluator = VespaEvaluator(
    queries=queries,
    relevant_docs=relevant_docs,
    vespa_query_fn=my_vespa_query_fn,
    app=app,
    name="test-run",
    accuracy_at_k=[1, 3, 5],
    precision_recall_at_k=[1, 3, 5],
    mrr_at_k=[10],
    ndcg_at_k=[10],
    map_at_k=[100],
    write_csv=True
)

results = evaluator()
print("Primary metric:", evaluator.primary_metric)
print("All results:", results)

Parameters:

Name Type Description Default
queries Dict[str, str]

A dictionary where keys are query IDs and values are query strings.

required
relevant_docs Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]

A dictionary mapping query IDs to their relevant document IDs. Can be a set of doc IDs for binary relevance, a dict of doc_id to relevance score (float between 0 and 1) for graded relevance, or a single doc_id string.

required
vespa_query_fn Callable[[str, int, Optional[str]], dict]

A function that takes a query string, the number of hits to retrieve (top_k), and an optional query_id, and returns a Vespa query body dictionary.

required
app Vespa

An instance of the Vespa application.

required
name str

A name for this evaluation run. Defaults to "".

''
id_field str

The field name in the Vespa hit that contains the document ID. If empty, it tries to infer the ID from the 'id' field or 'fields.id'. Defaults to "".

''
accuracy_at_k List[int]

List of k values for which to compute Accuracy@k. Defaults to [1, 3, 5, 10].

[1, 3, 5, 10]
precision_recall_at_k List[int]

List of k values for which to compute Precision@k and Recall@k. Defaults to [1, 3, 5, 10].

[1, 3, 5, 10]
mrr_at_k List[int]

List of k values for which to compute MRR@k. Defaults to [10].

[10]
ndcg_at_k List[int]

List of k values for which to compute NDCG@k. Defaults to [10].

[10]
map_at_k List[int]

List of k values for which to compute MAP@k. Defaults to [100].

[100]
write_csv bool

Whether to write the evaluation results to a CSV file. Defaults to False.

False
csv_dir Optional[str]

Directory to save the CSV file. Defaults to None (current directory).

None

run()

Executes the evaluation by running queries and computing IR metrics.

This method: 1. Executes all configured queries against the Vespa application. 2. Collects search results and timing information. 3. Computes the configured IR metrics (Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k). 4. Records search timing statistics. 5. Logs results and optionally writes them to CSV.

Returns:

Name Type Description
dict Dict[str, float]

A dictionary containing: - IR metrics with names like "accuracy@k", "precision@k", etc. - Search time statistics ("searchtime_avg", "searchtime_q50", etc.). The values are floats between 0 and 1 for metrics and in seconds for timing.

Example
{
    "accuracy@1": 0.75,
    "ndcg@10": 0.68,
    "searchtime_avg": 0.0123,
    ...
}

VespaMatchEvaluator(queries, relevant_docs, vespa_query_fn, app, id_field, name='', rank_profile='unranked', write_csv=False, write_verbose=False, csv_dir=None)

Bases: VespaEvaluatorBase

Evaluate recall in the match-phase over a set of queries for a Vespa application.

This class:

  • Iterates over queries and issues them against your Vespa application.
  • Sends one query with limit 0 to get the number of matched documents.
  • Sends one query with recall-parameter set according to the provided relevant documents.
  • Compares the retrieved documents with a set of relevant document ids.
  • Logs vespa search times for each query.
  • Logs/returns these metrics.
  • Optionally writes out to CSV.

Note: It is recommended to use a rank profile without any first-phase (and second-phase) ranking if you care about speed of evaluation run. If you do so, you need to make sure that the rank profile you use has the same inputs. For example, if you want to evaluate a YQL query including nearestNeighbor-operator, your rank-profile needs to define the corresponding input tensor. You must also either provide the query tensor or define it as input (e.g 'input.query(embedding)=embed(@query)') in your Vespa query function. Also note that the 'id_field' needs to be marked as an attribute in your Vespa schema, so filtering can be done on it. Example usage:

from vespa.application import Vespa
from vespa.evaluation import VespaEvaluator

queries = {
    "q1": "What is the best GPU for gaming?",
    "q2": "How to bake sourdough bread?",
    # ...
}
relevant_docs = {
    "q1": {"d12", "d99"},
    "q2": {"d101"},
    # ...
}
# relevant_docs can also be a dict of query_id => single relevant doc_id
# relevant_docs = {
#     "q1": "d12",
#     "q2": "d101",
#     # ...
# }
# Or, relevant_docs can be a dict of query_id => map of doc_id => relevance
# relevant_docs = {
#     "q1": {"d12": 1, "d99": 0.1},
#     "q2": {"d101": 0.01},
#     # ...

def my_vespa_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": 'select * from sources * where userInput("' + query_text + '");',
        "hits": top_k,
        "ranking": "your_ranking_profile",
    }

app = Vespa(url="http://localhost", port=8080)

evaluator = VespaMatchEvaluator(
    queries=queries,
    relevant_docs=relevant_docs,
    vespa_query_fn=my_vespa_query_fn,
    app=app,
    name="test-run",
    id_field="id",
    write_csv=True,
    write_verbose=True,
)

results = evaluator()
print("Primary metric:", evaluator.primary_metric)
print("All results:", results)

Parameters:

Name Type Description Default
queries Dict[str, str]

A dictionary where keys are query IDs and values are query strings.

required
relevant_docs Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]

A dictionary mapping query IDs to their relevant document IDs. Can be a set of doc IDs for binary relevance, or a single doc_id string. Graded relevance (dict of doc_id to relevance score) is not supported for match evaluation.

required
vespa_query_fn Callable[[str, int, Optional[str]], dict]

A function that takes a query string, the number of hits to retrieve (top_k), and an optional query_id, and returns a Vespa query body dictionary.

required
app Vespa

An instance of the Vespa application.

required
name str

A name for this evaluation run. Defaults to "".

''
id_field str

The field name in the Vespa hit that contains the document ID. If empty, it tries to infer the ID from the 'id' field or 'fields.id'. Defaults to "".

required
write_csv bool

Whether to write the summary evaluation results to a CSV file. Defaults to False.

False
write_verbose bool

Whether to write detailed query-level results to a separate CSV file. Defaults to False.

False
csv_dir Optional[str]

Directory to save the CSV files. Defaults to None (current directory).

None

create_grouping_filter(yql, id_field, relevant_ids) staticmethod

Create a grouping filter to append Vespa YQL queries to limit results to relevant documents. | all( group(id_field) filter(regex("", id_field)) each(output(count())))

Parameters: yql (str): The base YQL query string. id_field (str): The field name in the Vespa hit that contains the document ID. relevant_ids (list[str]): List of relevant document IDs to include in the filter.

Returns: str: The modified YQL query string with the grouping filter applied.

extract_matched_ids(resp, id_field) staticmethod

Extract matched document IDs from Vespa query response hits. Parameters: resp (VespaQueryResponse): The Vespa query response object. id_field (str): The field name in the Vespa hit that contains the document ID

Returns: Set[str]: A set of matched document IDs.

run()

Executes the match-phase recall evaluation.

This method: 1. Sends a grouping query to see which of the relevant documents were matched, and get totalCount. 3. Computes recall metrics and match statistics. 4. Logs results and optionally writes them to CSV.

Returns:

Name Type Description
dict Dict[str, float]

A dictionary containing recall metrics, match statistics, and search time statistics.

Example
{
    "match_recall": 0.85,
    "total_relevant_docs": 150,
    "total_matched_relevant": 128,
    "avg_matched_per_query": 45.2,
    "searchtime_avg": 0.015,
    ...
}

VespaCollectorBase(queries, relevant_docs, vespa_query_fn, app, id_field, name='', csv_dir=None, random_hits_strategy=RandomHitsSamplingStrategy.RATIO, random_hits_value=1.0, max_random_hits_per_query=None, collect_matchfeatures=True, collect_rankfeatures=False, collect_summaryfeatures=False, write_csv=True)

Bases: ABC

Abstract base class for Vespa training data collectors providing initialization and interface.

Initialize the VespaFeatureCollector.

Parameters:

Name Type Description Default
queries Dict[str, str]

Dictionary mapping query IDs to query strings

required
relevant_docs Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]

Dictionary mapping query IDs to relevant document IDs

required
vespa_query_fn Callable[[str, int, Optional[str]], dict]

Function to generate Vespa query bodies

required
app Vespa

Vespa application instance

required
id_field str

Field name containing document IDs in Vespa hits (must be defined as an attribute in the schema)

required
name str

Name for this collection run

''
csv_dir Optional[str]

Directory to save CSV files

None
random_hits_strategy Union[RandomHitsSamplingStrategy, str]

Strategy for sampling random hits - either "ratio" or "fixed" - RATIO: Sample random hits as a ratio of relevant docs - FIXED: Sample a fixed number of random hits per query

RATIO
random_hits_value Union[float, int]

Value for the sampling strategy - For RATIO: Ratio value (e.g., 1.0 = equal, 2.0 = twice as many random hits) - For FIXED: Fixed number of random hits per query

1.0
max_random_hits_per_query Optional[int]

Optional maximum limit on random hits per query (only applies when using RATIO strategy to prevent excessive sampling)

None
collect_matchfeatures bool

Whether to collect match features

True
collect_rankfeatures bool

Whether to collect rank features

False
collect_summaryfeatures bool

Whether to collect summary features

False
write_csv bool

Whether to write results to CSV file

True

collect() abstractmethod

Abstract method to be implemented by subclasses.

__call__()

Make the collector callable.

VespaFeatureCollector(queries, relevant_docs, vespa_query_fn, app, id_field, name='', csv_dir=None, random_hits_strategy=RandomHitsSamplingStrategy.RATIO, random_hits_value=1.0, max_random_hits_per_query=None, collect_matchfeatures=True, collect_rankfeatures=False, collect_summaryfeatures=False, write_csv=True)

Bases: VespaCollectorBase

Collects training data for retrieval tasks from a Vespa application.

This class:

  • Iterates over queries and issues them against your Vespa application.
  • Retrieves top-k documents per query.
  • Samples random hits based on the specified strategy.
  • Compiles a CSV file with query-document pairs and their relevance labels.

Important: If you want to sample random hits, you need to make sure that the rank profile you define in your vespa_query_fn has a ranking expression that reflects this. See docs for example. In this case, be aware that the relevance_score value in the returned results (or CSV) will be of no value. This will only have meaning if you use this to collect features for relevant documents only.

Example usage
from vespa.application import Vespa
from vespa.evaluation import VespaFeatureCollector

queries = {
    "q1": "What is the best GPU for gaming?",
    "q2": "How to bake sourdough bread?",
    # ...
}
relevant_docs = {
    "q1": {"d12", "d99"},
    "q2": {"d101"},
    # ...
}

def my_vespa_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": 'select * from sources * where userInput("' + query_text + '");',
        "hits": 10,  # Do not make use of top_k here.
        "ranking": "your_ranking_profile", # This should have `random` as ranking expression
    }

app = Vespa(url="http://localhost", port=8080)

collector = VespaFeatureCollector(
    queries=queries,
    relevant_docs=relevant_docs,
    vespa_query_fn=my_vespa_query_fn,
    app=app,
    id_field="id",  # Field in Vespa hit that contains the document ID (must be an attribute)
    name="retrieval-data-collection",
    csv_dir="/path/to/save/csv",
    random_hits_strategy="ratio",  # or RandomHitsSamplingStrategy.RATIO
    random_hits_value=1.0,  # Sample equal number of random hits to relevant docs
    max_random_hits_per_query=100,  # Optional: cap random hits per query
    collect_matchfeatures=True,  # Collect match features from rank profile
    collect_rankfeatures=False,  # Skip traditional rank features
    collect_summaryfeatures=False,  # Skip summary features
)

collector()

Alternative Usage Examples:

# Example 1: Fixed number of random hits per query
collector = VespaFeatureCollector(
    queries=queries,
    relevant_docs=relevant_docs,
    vespa_query_fn=my_vespa_query_fn,
    app=app,
    id_field="id",  # Required field name
    random_hits_strategy="fixed",
    random_hits_value=50,  # Always sample 50 random hits per query
)

# Example 2: Ratio-based with a cap
collector = VespaFeatureCollector(
    queries=queries,
    relevant_docs=relevant_docs,
    vespa_query_fn=my_vespa_query_fn,
    app=app,
    id_field="id",  # Required field name
    random_hits_strategy="ratio",
    random_hits_value=2.0,  # Sample twice as many random hits as relevant docs
    max_random_hits_per_query=200,  # But never more than 200 per query
)

Parameters:

Name Type Description Default
queries Dict[str, str]

A dictionary where keys are query IDs and values are query strings.

required
relevant_docs Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]

A dictionary mapping query IDs to their relevant document IDs. Can be a set of doc IDs for binary relevance, a dict of doc_id to relevance score (float between 0 and 1) for graded relevance, or a single doc_id string.

required
vespa_query_fn Callable[[str, int, Optional[str]], dict]

A function that takes a query string, the number of hits to retrieve (top_k), and an optional query_id, and returns a Vespa query body dictionary.

required
app Vespa

An instance of the Vespa application.

required
id_field str

The field name in the Vespa hit that contains the document ID. This field must be defined as an attribute in your Vespa schema.

required
name str

A name for this data collection run. Defaults to "".

''
csv_dir Optional[str]

Directory to save the CSV file. Defaults to None (current directory).

None
random_hits_strategy Union[RandomHitsSamplingStrategy, str]

Strategy for sampling random hits. Can be "ratio" (or RandomHitsSamplingStrategy.RATIO) to sample as a ratio of relevant docs, or "fixed" (or RandomHitsSamplingStrategy.FIXED) to sample a fixed number per query. Defaults to "ratio".

RATIO
random_hits_value Union[float, int]

Value for the sampling strategy. For RATIO strategy: ratio value (e.g., 1.0 = equal number, 2.0 = twice as many random hits). For FIXED strategy: fixed number of random hits per query. Defaults to 1.0.

1.0
max_random_hits_per_query Optional[int]

Maximum limit on random hits per query. Only applies to RATIO strategy to prevent excessive sampling. Defaults to None (no limit).

None
collect_matchfeatures bool

Whether to collect match features defined in rank profile's match-features section. Defaults to True.

True
collect_rankfeatures bool

Whether to collect rank features using ranking.listFeatures=true. Defaults to False.

False
collect_summaryfeatures bool

Whether to collect summary features from document summaries. Defaults to False.

False
write_csv bool

Whether to write results to CSV file. Defaults to True.

True

get_recall_param(relevant_doc_ids, get_relevant)

Adds the recall parameter to the Vespa query body based on relevant document IDs.

Parameters:

Name Type Description Default
relevant_doc_ids set

A set of relevant document IDs.

required
get_relevant bool

Whether to retrieve relevant documents.

required

Returns:

Name Type Description
dict dict

The updated Vespa query body with the recall parameter.

calculate_random_hits_count(num_relevant_docs)

Calculate the number of random hits to sample based on the configured strategy.

Parameters:

Name Type Description Default
num_relevant_docs int

Number of relevant documents for the query

required

Returns:

Type Description
int

Number of random hits to sample

collect()

Collects training data by executing queries and saving results to CSV.

This method: 1. Executes all configured queries against the Vespa application. 2. Collects the top-k document IDs and their relevance labels. 3. Optionally writes the data to a CSV file for training purposes. 4. Returns the collected data as a single dictionary with results.

Returns:

Type Description
Dict[str, List[Dict]]

Dict containing:

Dict[str, List[Dict]]
  • 'results': List of dictionaries, each containing all data for a query-document pair (query_id, doc_id, relevance_label, relevance_score, and all extracted features)

VespaNNParameters

Collection of nearest-neighbor query parameters used in nearest-neighbor classes.

VespaNNUnsuccessfulQueryError

Bases: Exception

Exception raised when trying to determine the hit ratio or compute the recall of an unsuccessful query.

VespaNNGlobalFilterHitratioEvaluator(queries, app, verify_target_hits=None)

Determine the hit ratio of the global filter in ANN queries. This hit ratio determines the search strategy used to perform the nearest-neighbor search and is essential to understanding and optimizing the behavior of Vespa on these queries.

This class:

  • Takes a list of queries.
  • Runs the queries with tracing.
  • Determines the hit ratio by examining the trace.

Parameters:

Name Type Description Default
queries Sequence[Mapping[str, str]]

List of ANN queries.

required
app Vespa

An instance of the Vespa application.

required

run()

Determines the hit ratios of the global filters in the supplied ANN queries.

Returns:

Type Description

List[List[float]]: List of lists of hit ratios, which are values from the interval [0.0, 1.0], corresponding to the supplied queries.

VespaNNRecallEvaluator(queries, hits, app, query_limit=20, **kwargs)

Determine recall of ANN queries. The recall of an ANN query with k hits is the number of hits that actually are among the k nearest neighbors of the query vector.

This class:

  • Takes a list of queries.
  • First runs the queries as is (with the supplied HTTP parameters).
  • Then runs the queries with the supplied HTTP parameters and an additional parameter enforcing an exact nearest neighbor search.
  • Determines the recall by comparing the results.

Parameters:

Name Type Description Default
queries Sequence[Mapping[str, Any]]

List of ANN queries.

required
hits int

Number of hits to use. Should match the parameter targetHits in the used ANN queries.

required
app Vespa

An instance of the Vespa application.

required
query_limit int

Maximum number of queries to determine the recall for. Defaults to 20.

20
**kwargs dict {}

run()

Computes the recall of the supplied queries.

Returns:

Type Description
List[float]

List[float]: List of recall values from the interval [0.0, 1.0] corresponding to the supplied queries.

VespaQueryBenchmarker(queries, app, time_limit=2000, max_concurrent=10, **kwargs)

Determine the searchtime of queries by running them multiple times and taking the average. Using the searchtime has the advantage of not including network latency.

This class:

  • Takes a list of queries.
  • Runs the queries for the given amount of time.
  • Determines the average searchtime of these runs.

Parameters:

Name Type Description Default
queries Sequence[Mapping[str, Any]]

List of queries.

required
app Vespa

An instance of the Vespa application.

required
time_limit int

Time to run the benchmark for (in milliseconds).

2000
**kwargs dict {}

run()

Runs the benchmark (including a warm-up run not included in the result).

Returns:

Type Description
List[float]

List[float]: List of searchtimes, corresponding to the supplied queries.

BucketedMetricResults(metric_name, buckets, values, filtered_out_ratios)

Stores aggregated statistics for a metric across query buckets.

Computes mean and various percentiles for values grouped by bucket, where each bucket contains multiple measurements (e.g., response times or recall values).

Parameters:

Name Type Description Default
metric_name str

Name of the metric being measured (e.g., "searchtime", "recall")

required
buckets List[int]

List of bucket indices that contain data

required
values List[List[float]]

List of lists containing measurements, one list per bucket

required
filtered_out_ratios List[float]

Pre-computed filtered-out ratios for each bucket

required

to_dict()

Convert results to dictionary format.

Returns:

Type Description
Dict[str, Any]

Dictionary containing bucket information and all statistics

VespaNNParameterOptimizer(app, queries, hits, buckets_per_percent=2, print_progress=False, benchmark_time_limit=5000, recall_query_limit=20, max_concurrent=10)

Get suggestions for configuring the nearest-neighbor parameters of a Vespa application.

This class:

  • Sorts ANN queries into buckets based on the hit-ratio of their global filter.
  • For every bucket, can determine the average response time of the queries in this bucket.
  • For every bucket, can determine the average recall of the queries in this bucket.
  • Can suggest a value for postFilterThreshold.
  • Can suggest a value for filterFirstThreshold.
  • Can suggest a value for filterFirstExploration.
  • Can suggest a value for approximateThreshold.

Parameters:

Name Type Description Default
app Vespa

An instance of the Vespa application.

required
queries Sequence[Mapping[str, Any]]

Queries to optimize for.

required
hits int

Number of hits to use in recall computations. Has to match the parameter targetHits in the used ANN queries.

required
buckets_per_percent int

How many buckets are created for every percent point, "resolution" of the suggestions. Defaults to 2.

2
print_progress bool

Whether to print progress information while determining suggestions. Defaults to False.

False
benchmark_time_limit int

Time in milliseconds to spend per bucket benchmark. Defaults to 5000.

5000
recall_query_limit int

Number of queries per bucket to compute the recall for. Defaults to 20.

20
max_concurrent int

Number of queries to execute concurrently during benchmark/recall calculation. Defaults to 10.

10

get_bucket_interval_width()

Gets the width of the interval represented by a single bucket.

Returns:

Name Type Description
float float

Width of the interval represented by a single bucket.

get_number_of_buckets()

Gets the number of buckets.

Returns:

Name Type Description
int int

Number of buckets.

get_number_of_nonempty_buckets()

Counts the number of buckets that contain at least one query.

Returns:

Name Type Description
int int

The number of buckets that contain at least one query.

get_non_empty_buckets()

Gets the indices of the non-empty buckets.

Returns:

Type Description
List[int]

List[int]: List of indices of the non-empty buckets.

get_filtered_out_ratios()

Gets the (lower interval ends of the) filtered-out ratios of the non-empty buckets.

Returns:

Type Description
List[float]

List[float]: List of the (lower interval ends of the) filtered-out ratios of the non-empty buckets.

get_number_of_queries()

Gets the number of queries contained in the buckets.

Returns:

Name Type Description
int

Number of queries contained in the buckets.

bucket_to_hitratio(bucket)

Gets the hit ratio (upper endpoint of interval) corresponding to the given bucket index.

Parameters:

Name Type Description Default
bucket int

Index of a bucket.

required

Returns:

Name Type Description
float float

Hit ratio corresponding to the given bucket index.

bucket_to_filtered_out(bucket)

Gets the filtered-out ratio (1 - hit ratio, lower endpoint of interval) corresponding to the given bucket index.

Parameters:

Name Type Description Default
bucket int

Index of a bucket.

required

Returns:

Name Type Description
float float

Filtered-out ratio corresponding to the given bucket index.

buckets_to_filtered_out(buckets)

Applies bucket_to_filtered_out to list of bucket indices.

Parameters:

Name Type Description Default
buckets List[int]

List of bucket indices.

required

Returns:

Type Description
List[float]

List[float]: Filtered-out ratios corresponding to the given bucket indices.

filtered_out_to_bucket(percent)

Gets the index of the bucket containing the given filtered-out ratio.

Parameters:

Name Type Description Default
percent float

Filtered-out ratio.

required

Returns:

Name Type Description
int int

Index of bucket containing the given filtered-out ratio.

distribute_to_buckets(queries_with_hitratios)

Distributes the given queries to buckets according to their given hit ratios.

Parameters:

Name Type Description Default
queries_with_hitratios List[Dict[str, str], float]

Queries with hit ratios.

required

Returns:

Type Description
List[List[str]]

List[List[str]]: List of buckets.

determine_hit_ratios_and_distribute_to_buckets(queries)

Distributes the given queries to buckets by determining their hit ratios.

Parameters:

Name Type Description Default
queries Sequence[Mapping[str, Any]]

Queries.

required

Returns:

Type Description
List[List[str]]

List[List[str]]: List of buckets.

query_from_get_string(get_query) staticmethod

Parses a query in GET format.

Parameters:

Name Type Description Default
get_query str

Query as a single-line GET string.

required

Returns:

Type Description
Dict[str, str]

Dict[str,str]: Query as a dict.

distribute_file_to_buckets(filename)

Distributes the queries from the given file to buckets according to their given hit ratios.

Parameters:

Name Type Description Default
filename str

Name of file with GET queries (one per line).

required

Returns:

Type Description
List[List[str]]

List[List[str]]: List of buckets.

has_sufficient_queries()

Checks whether the given queries are deemed sufficient to give meaningful suggestions.

Returns:

Name Type Description
bool bool

Whether the given queries are deemed sufficient to give meaningful suggestions.

buckets_sufficiently_filled()

Checks whether all non-empty buckets have at least 10 queries.

Returns:

Name Type Description
bool bool

Whether all non-empty buckets have at least 10 queries.

get_query_distribution()

Gets the distribution of queries across all buckets.

Returns:

Type Description

List[float]: List of filtered-out ratios corresponding to non-empty buckets.

List[int]: List of numbers of queries.

benchmark(**kwargs)

For each non-empty bucket, determine the average searchtime.

Parameters:

Name Type Description Default
**kwargs dict {}

Returns:

Name Type Description
BucketedMetricResults BucketedMetricResults

The benchmark results.

compute_average_recalls(**kwargs)

For each non-empty bucket, determine the average recall.

Parameters:

Name Type Description Default
**kwargs dict {}

Returns:

Name Type Description
BucketedMetricResults BucketedMetricResults

The recall results.

suggest_filter_first_threshold(**kwargs)

Suggests a value for filterFirstThreshold based on performed benchmarks.

Parameters:

Name Type Description Default
**kwargs dict

Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. Should contain ranking.matching.filterFirstExploration!

{}

Returns:

Name Type Description
float dict[str, float | dict[str, List[float]]]

Suggested value for filterFirstThreshold.

suggest_approximate_threshold(**kwargs)

Suggests a value for approximateThreshold based on performed benchmarks.

Parameters:

Name Type Description Default
**kwargs dict

Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. Should contain ranking.matching.filterFirstExploration and ranking.matching.filterFirstThreshold!

{}

Returns:

Name Type Description
float dict[str, float | dict[str, List[float]]]

Suggested value for approximateThreshold.

suggest_post_filter_threshold(**kwargs)

Suggests a value for postFilterThreshold based on performed benchmarks and recall measurements.

Parameters:

Name Type Description Default
**kwargs dict

Additional HTTP request parameters. See: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#request-parameters. Should contain ranking.matching.filterFirstExploration, ranking.matching.filterFirstThreshold, and ranking.matching.approximateThreshold!

{}

Returns:

Name Type Description
float dict[str, float | dict[str, List[float]]]

Suggested value for postFilterThreshold.

suggest_filter_first_exploration()

Suggests a value for filterFirstExploration based on benchmarks and recall measurements performed on the supplied Vespa app.

Returns:

Name Type Description
dict dict[str, float | dict[str, List[float]]]

A dictionary containing the suggested value, benchmarks, and recall measurements.

run()

Determines suggestions for all parameters supported by this class.

This method: 1. Determines the hit-ratios of supplied ANN queries. 2. Sorts these queries into buckets based on the determined hit-ratio. 3. Determines a suggestion for filterFirstExploration. 4. Determines a suggestion for filterFirstThreshold. 5. Determines a suggestion for approximateThreshold. 6. Determines a suggestion for postFilterThreshold. 7. Reports the determined suggestions and all benchmarks and recall measurements performed.

Returns:

Name Type Description
dict Dict[str, Any]

A dictionary containing the suggested values, information about the query distribution, performed benchmarks, and recall measurements.

Example
{
    "buckets": {
        "buckets_per_percent": 2,
        "bucket_interval_width": 0.005,
        "non_empty_buckets": [
            2,
            20,
            100,
            180,
            190,
            198
        ],
        "filtered_out_ratios": [
            0.01,
            0.1,
            0.5,
            0.9,
            0.95,
            0.99
        ],
        "hit_ratios": [
            0.99,
            0.9,
            0.5,
            0.09999999999999998,
            0.050000000000000044,
            0.010000000000000009
        ],
        "query_distribution": [
            100,
            100,
            100,
            100,
            100,
            100
        ]
    },
    "filterFirstExploration": {
        "suggestion": 0.39453125,
        "benchmarks": {
            "0.0": [
                4.265999999999999,
                4.256000000000001,
                3.9430000000000005,
                3.246999999999998,
                2.4610000000000003,
                1.768
            ],
            "1.0": [
                3.9259999999999984,
                3.6010000000000004,
                3.290999999999999,
                3.78,
                4.927000000000002,
                8.415000000000001
            ],
            "0.5": [
                3.6299999999999977,
                3.417,
                3.4490000000000007,
                3.752,
                4.257,
                5.99
            ],
            "0.25": [
                3.5830000000000006,
                3.616,
                3.3239999999999985,
                3.3200000000000016,
                2.654999999999999,
                2.3789999999999996
            ],
            "0.375": [
                3.465,
                3.4289999999999994,
                3.196999999999997,
                3.228999999999999,
                3.167,
                3.700999999999999
            ],
            "0.4375": [
                3.9880000000000013,
                3.463000000000002,
                3.4650000000000007,
                3.5000000000000013,
                3.7499999999999982,
                4.724000000000001
            ],
            "0.40625": [
                3.4990000000000006,
                3.3680000000000003,
                3.147000000000001,
                3.33,
                3.381,
                4.083999999999998
            ],
            "0.390625": [
                3.6060000000000008,
                3.5269999999999992,
                3.2820000000000005,
                3.433999999999998,
                3.2880000000000007,
                3.8609999999999984
            ],
            "0.3984375": [
                3.6870000000000016,
                3.386000000000001,
                3.336000000000001,
                3.316999999999999,
                3.5329999999999973,
                4.719000000000002
            ]
        },
        "recall_measurements": {
            "0.0": [
                0.8758,
                0.8768999999999997,
                0.8915,
                0.9489999999999994,
                0.9045999999999998,
                0.64
            ],
            "1.0": [
                0.8757,
                0.8768999999999997,
                0.8909999999999999,
                0.9675999999999998,
                0.9852999999999996,
                0.9957999999999998
            ],
            "0.5": [
                0.8757,
                0.8768999999999997,
                0.8909999999999999,
                0.9660999999999998,
                0.9759999999999996,
                0.9903
            ],
            "0.25": [
                0.8757,
                0.8768999999999997,
                0.8909999999999999,
                0.9553999999999995,
                0.9323999999999996,
                0.8123000000000004
            ],
            "0.375": [
                0.8757,
                0.8768999999999997,
                0.8909999999999999,
                0.9615999999999997,
                0.9599999999999999,
                0.9626000000000002
            ],
            "0.4375": [
                0.8757,
                0.8768999999999997,
                0.8909999999999999,
                0.9642999999999999,
                0.9697999999999999,
                0.9832
            ],
            "0.40625": [
                0.8757,
                0.8768999999999997,
                0.8909999999999999,
                0.9632,
                0.9642999999999999,
                0.9763999999999997
            ],
            "0.390625": [
                0.8757,
                0.8768999999999997,
                0.8909999999999999,
                0.9625999999999999,
                0.9617999999999999,
                0.9688999999999998
            ],
            "0.3984375": [
                0.8757,
                0.8768999999999997,
                0.8909999999999999,
                0.963,
                0.9635000000000001,
                0.9738999999999999
            ]
        }
    },
    "filterFirstThreshold": {
        "suggestion": 0.47,
        "benchmarks": {
            "hnsw": [
                2.779,
                2.725000000000001,
                3.151999999999999,
                7.138999999999998,
                11.362,
                32.599999999999994
            ],
            "filter_first": [
                3.543999999999999,
                3.454,
                3.443999999999999,
                3.4129999999999994,
                3.4090000000000003,
                4.602999999999998
            ]
        },
        "recall_measurements": {
            "hnsw": [
                0.8284999999999996,
                0.8368999999999996,
                0.9007999999999996,
                0.9740999999999996,
                0.9852999999999993,
                0.9937999999999992
            ],
            "filter_first": [
                0.8757,
                0.8768999999999997,
                0.8909999999999999,
                0.9627999999999999,
                0.9630000000000001,
                0.9718999999999994
            ]
        }
    },
    "approximateThreshold": {
        "suggestion": 0.03,
        "benchmarks": {
            "exact": [
                33.072,
                31.99600000000001,
                23.256,
                9.155,
                6.069000000000001,
                2.0949999999999984
            ],
            "filter_first": [
                2.9570000000000003,
                2.91,
                3.165000000000001,
                3.396999999999998,
                3.3310000000000004,
                4.046
            ]
        },
        "recall_measurements": {
            "exact": [
                1.0,
                1.0,
                1.0,
                1.0,
                1.0,
                1.0
            ],
            "filter_first": [
                0.8284999999999996,
                0.8368999999999996,
                0.9007999999999996,
                0.9627999999999999,
                0.9630000000000001,
                0.9718999999999994
            ]
        }
    },
    "postFilterThreshold": {
        "suggestion": 0.49,
        "benchmarks": {
            "post_filtering": [
                2.0609999999999995,
                2.448,
                3.097999999999999,
                7.200999999999999,
                11.463000000000006,
                11.622999999999996
            ],
            "filter_first": [
                3.177999999999999,
                2.717000000000001,
                3.177,
                3.5000000000000004,
                3.455,
                2.1159999999999997
            ]
        },
        "recall_measurements": {
            "post_filtering": [
                0.8288999999999995,
                0.8355,
                0.8967999999999998,
                0.9519999999999997,
                0.9512999999999994,
                0.19180000000000003
            ],
            "filter_first": [
                0.8284999999999996,
                0.8368999999999996,
                0.9007999999999996,
                0.9627999999999999,
                0.9630000000000001,
                1.0
            ]
        }
    }
}

mean(values)

Compute the mean of a list of numbers without using numpy.

percentile(values, p)

Compute the p-th percentile of a list of values (0 <= p <= 100). This approximates numpy.percentile's behavior.

validate_queries(queries)

Validate and normalize queries. Converts query IDs to strings if they are ints.

validate_qrels(qrels)

Validate and normalize qrels. Converts query IDs to strings if they are ints.

validate_vespa_query_fn(fn)

Validates the vespa_query_fn function.

The function must be callable and accept either 2 or 3 parameters
  • (query_text: str, top_k: int)
  • or (query_text: str, top_k: int, query_id: Optional[str])

It must return a dictionary when called with test inputs.

Returns True if the function takes a query_id parameter, False otherwise.

filter_queries(queries, relevant_docs)

Filter out queries that have no relevant docs

extract_doc_id_from_hit(hit, id_field)

Extract document ID from a Vespa hit.

get_id_field_from_hit(hit, id_field)

Get the ID field from a Vespa hit.

calculate_searchtime_stats(searchtimes)

Calculate search time statistics.

execute_queries(app, query_bodies, max_concurrent=10)

Execute queries and collect timing information. Returns the responses and a list of search times.

write_csv(metrics, searchtime_stats, csv_file, csv_dir, name)

Write metrics to CSV file.

log_metrics(name, metrics)

Log metrics with appropriate formatting.

extract_features_from_hit(hit, collect_matchfeatures, collect_rankfeatures, collect_summaryfeatures)

Extract features from a Vespa hit based on the collection configuration.

Parameters:

Name Type Description Default
hit dict

The Vespa hit dictionary

required
collect_matchfeatures bool

Whether to collect match features

required
collect_rankfeatures bool

Whether to collect rank features

required
collect_summaryfeatures bool

Whether to collect summary features

required

Returns:

Type Description
Dict[str, float]

Dict mapping feature names to values

__getattr__(name)

Lazy import for optional MTEB dependencies.