Skip to content

Evaluation

vespa.evaluation

VespaEvaluator(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', accuracy_at_k=[1, 3, 5, 10], precision_recall_at_k=[1, 3, 5, 10], mrr_at_k=[10], ndcg_at_k=[10], map_at_k=[100], write_csv=False, csv_dir=None)

Evaluate retrieval performance on a Vespa application.

This class:

  • Iterates over queries and issues them against your Vespa application.
  • Retrieves top-k documents per query (with k = max of your IR metrics).
  • Compares the retrieved documents with a set of relevant document ids.
  • Computes IR metrics: Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k.
  • Logs vespa search times for each query.
  • Logs/returns these metrics.
  • Optionally writes out to CSV.
Example usage
from vespa.application import Vespa
from vespa.evaluation import VespaEvaluator

queries = {
    "q1": "What is the best GPU for gaming?",
    "q2": "How to bake sourdough bread?",
    # ...
}
relevant_docs = {
    "q1": {"d12", "d99"},
    "q2": {"d101"},
    # ...
}
# relevant_docs can also be a dict of query_id => single relevant doc_id
# relevant_docs = {
#     "q1": "d12",
#     "q2": "d101",
#     # ...
# }
# Or, relevant_docs can be a dict of query_id => map of doc_id => relevance
# relevant_docs = {
#     "q1": {"d12": 1, "d99": 0.1},
#     "q2": {"d101": 0.01},
#     # ...
# Note that for non-binary relevance, the relevance values should be in [0, 1], and that
# only the nDCG metric will be computed.

def my_vespa_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": 'select * from sources * where userInput("' + query_text + '");',
        "hits": top_k,
        "ranking": "your_ranking_profile",
    }

app = Vespa(url="http://localhost", port=8080)

evaluator = VespaEvaluator(
    queries=queries,
    relevant_docs=relevant_docs,
    vespa_query_fn=my_vespa_query_fn,
    app=app,
    name="test-run",
    accuracy_at_k=[1, 3, 5],
    precision_recall_at_k=[1, 3, 5],
    mrr_at_k=[10],
    ndcg_at_k=[10],
    map_at_k=[100],
    write_csv=True
)

results = evaluator()
print("Primary metric:", evaluator.primary_metric)
print("All results:", results)

Parameters:

Name Type Description Default
queries dict

A dictionary of query_id => query text.

required
relevant_docs dict

A dictionary of query_id => set of relevant doc_ids or query_id => dict of doc_id => relevance. See example usage.

required
vespa_query_fn callable

A callable with the signature my_func(query: str, top_k: int) -> dict. Given a query string and top_k, returns a Vespa query body (dict).

required
app Vespa

A vespa.application.Vespa instance.

required
name str

A name or tag for this evaluation run.

''
id_field str

The field name in Vespa that contains the document ID. If unset, will try to use the Vespa internal document ID, but this may fail in some cases (see https://docs.vespa.ai/en/documents.html#docid-in-results).

''
accuracy_at_k list of int

List of k-values for Accuracy@k.

[1, 3, 5, 10]
precision_recall_at_k list of int

List of k-values for Precision@k and Recall@k.

[1, 3, 5, 10]
mrr_at_k list of int

List of k-values for MRR@k.

[10]
ndcg_at_k list of int

List of k-values for NDCG@k.

[10]
map_at_k list of int

List of k-values for MAP@k.

[100]
write_csv bool

If True, writes results to CSV.

False
csv_dir str

Path to write the CSV file (default is the current working directory).

None

filter_queries(queries, relevant_docs)

Filter out queries that have no relevant docs

run()

Executes the evaluation by running queries and computing IR metrics.

This method: 1. Executes all configured queries against the Vespa application. 2. Collects search results and timing information. 3. Computes the configured IR metrics (Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k). 4. Records search timing statistics. 5. Logs results and optionally writes them to CSV.

Returns:

Name Type Description
dict Dict[str, float]

A dictionary containing: - IR metrics with names like "accuracy@k", "precision@k", etc. - Search time statistics ("searchtime_avg", "searchtime_q50", etc.). The values are floats between 0 and 1 for metrics and in seconds for timing.

Example
{
    "accuracy@1": 0.75,
    "ndcg@10": 0.68,
    "searchtime_avg": 0.0123,
    ...
}

mean(values)

Compute the mean of a list of numbers without using numpy.

percentile(values, p)

Compute the p-th percentile of a list of values (0 <= p <= 100). This approximates numpy.percentile's behavior.