Evaluation

`vespa.evaluation`

`RandomHitsSamplingStrategy`

Bases: Enum

Enum for different random hits sampling strategies.

RATIO: Sample random hits as a ratio of relevant docs (e.g., 1.0 = equal number, 2.0 = twice as many)
FIXED: Sample a fixed number of random hits per query

`VespaEvaluatorBase(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', write_csv=False, csv_dir=None)`

Bases: ABC

Abstract base class for Vespa evaluators providing initialization and interface.

`run()` `abstractmethod`

Abstract method to be implemented by subclasses.

`call()`

Make the evaluator callable.

`VespaEvaluator(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', accuracy_at_k=[1, 3, 5, 10], precision_recall_at_k=[1, 3, 5, 10], mrr_at_k=[10], ndcg_at_k=[10], map_at_k=[100], write_csv=False, csv_dir=None)`

Bases: VespaEvaluatorBase

Evaluate retrieval performance on a Vespa application.

This class:

Iterates over queries and issues them against your Vespa application.
Retrieves top-k documents per query (with k = max of your IR metrics).
Compares the retrieved documents with a set of relevant document ids.
Computes IR metrics: Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k.
Logs vespa search times for each query.
Logs/returns these metrics.
Optionally writes out to CSV.

Note: The 'id_field' needs to be marked as an attribute in your Vespa schema, so filtering can be done on it.

Example usage

from vespa.application import Vespa
from vespa.evaluation import VespaEvaluator

queries = {
    "q1": "What is the best GPU for gaming?",
    "q2": "How to bake sourdough bread?",
    # ...
}
relevant_docs = {
    "q1": {"d12", "d99"},
    "q2": {"d101"},
    # ...
}
# relevant_docs can also be a dict of query_id => single relevant doc_id
# relevant_docs = {
#     "q1": "d12",
#     "q2": "d101",
#     # ...
# }
# Or, relevant_docs can be a dict of query_id => map of doc_id => relevance
# relevant_docs = {
#     "q1": {"d12": 1, "d99": 0.1},
#     "q2": {"d101": 0.01},
#     # ...
# Note that for non-binary relevance, the relevance values should be in [0, 1], and that
# only the nDCG metric will be computed.

def my_vespa_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": 'select * from sources * where userInput("' + query_text + '");',
        "hits": top_k,
        "ranking": "your_ranking_profile",
    }

app = Vespa(url="http://localhost", port=8080)

evaluator = VespaEvaluator(
    queries=queries,
    relevant_docs=relevant_docs,
    vespa_query_fn=my_vespa_query_fn,
    app=app,
    name="test-run",
    accuracy_at_k=[1, 3, 5],
    precision_recall_at_k=[1, 3, 5],
    mrr_at_k=[10],
    ndcg_at_k=[10],
    map_at_k=[100],
    write_csv=True
)

results = evaluator()
print("Primary metric:", evaluator.primary_metric)
print("All results:", results)

Parameters:

Name	Type	Description	Default
`queries`	`Dict[str, str]`	A dictionary where keys are query IDs and values are query strings.	required
`relevant_docs`	`Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]`	A dictionary mapping query IDs to their relevant document IDs. Can be a set of doc IDs for binary relevance, a dict of doc_id to relevance score (float between 0 and 1) for graded relevance, or a single doc_id string.	required
`vespa_query_fn`	`Callable[[str, int, Optional[str]], dict]`	A function that takes a query string, the number of hits to retrieve (top_k), and an optional query_id, and returns a Vespa query body dictionary.	required
`app`	`Vespa`	An instance of the Vespa application.	required
`name`	`str`	A name for this evaluation run. Defaults to "".	`''`
`id_field`	`str`	The field name in the Vespa hit that contains the document ID. If empty, it tries to infer the ID from the 'id' field or 'fields.id'. Defaults to "".	`''`
`accuracy_at_k`	`List[int]`	List of k values for which to compute Accuracy@k. Defaults to [1, 3, 5, 10].	`[1, 3, 5, 10]`
`precision_recall_at_k`	`List[int]`	List of k values for which to compute Precision@k and Recall@k. Defaults to [1, 3, 5, 10].	`[1, 3, 5, 10]`
`mrr_at_k`	`List[int]`	List of k values for which to compute MRR@k. Defaults to [10].	`[10]`
`ndcg_at_k`	`List[int]`	List of k values for which to compute NDCG@k. Defaults to [10].	`[10]`
`map_at_k`	`List[int]`	List of k values for which to compute MAP@k. Defaults to [100].	`[100]`
`write_csv`	`bool`	Whether to write the evaluation results to a CSV file. Defaults to False.	`False`
`csv_dir`	`Optional[str]`	Directory to save the CSV file. Defaults to None (current directory).	`None`

`run()`

Executes the evaluation by running queries and computing IR metrics.

This method: 1. Executes all configured queries against the Vespa application. 2. Collects search results and timing information. 3. Computes the configured IR metrics (Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k). 4. Records search timing statistics. 5. Logs results and optionally writes them to CSV.

Returns:

Name	Type	Description
`dict`	`Dict[str, float]`	A dictionary containing: - IR metrics with names like "accuracy@k", "precision@k", etc. - Search time statistics ("searchtime_avg", "searchtime_q50", etc.). The values are floats between 0 and 1 for metrics and in seconds for timing.

Example

{
    "accuracy@1": 0.75,
    "ndcg@10": 0.68,
    "searchtime_avg": 0.0123,
    ...
}

`VespaMatchEvaluator(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', rank_profile='unranked', write_csv=False, write_verbose=False, csv_dir=None)`

Bases: VespaEvaluatorBase

Evaluate recall in the match-phase over a set of queries for a Vespa application.

This class:

Iterates over queries and issues them against your Vespa application.
Sends one query with limit 0 to get the number of matched documents.
Sends one query with recall-parameter set according to the provided relevant documents.
Compares the retrieved documents with a set of relevant document ids.
Logs vespa search times for each query.
Logs/returns these metrics.
Optionally writes out to CSV.

Note: It is recommended to use a rank profile without any first-phase (and second-phase) ranking if you care about speed of evaluation run. If you do so, you need to make sure that the rank profile you use has the same inputs. For example, if you want to evaluate a YQL query including nearestNeighbor-operator, your rank-profile needs to define the corresponding input tensor. You must also either provide the query tensor or define it as input (e.g 'input.query(embedding)=embed(@query)') in your Vespa query function. Also note that the 'id_field' needs to be marked as an attribute in your Vespa schema, so filtering can be done on it. Example usage:

from vespa.application import Vespa
from vespa.evaluation import VespaEvaluator

queries = {
    "q1": "What is the best GPU for gaming?",
    "q2": "How to bake sourdough bread?",
    # ...
}
relevant_docs = {
    "q1": {"d12", "d99"},
    "q2": {"d101"},
    # ...
}
# relevant_docs can also be a dict of query_id => single relevant doc_id
# relevant_docs = {
#     "q1": "d12",
#     "q2": "d101",
#     # ...
# }
# Or, relevant_docs can be a dict of query_id => map of doc_id => relevance
# relevant_docs = {
#     "q1": {"d12": 1, "d99": 0.1},
#     "q2": {"d101": 0.01},
#     # ...

def my_vespa_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": 'select * from sources * where userInput("' + query_text + '");',
        "hits": top_k,
        "ranking": "your_ranking_profile",
    }

app = Vespa(url="http://localhost", port=8080)

evaluator = VespaMatchEvaluator(
    queries=queries,
    relevant_docs=relevant_docs,
    vespa_query_fn=my_vespa_query_fn,
    app=app,
    name="test-run",
    id_field="id",
    write_csv=True,
    write_verbose=True,
)

results = evaluator()
print("Primary metric:", evaluator.primary_metric)
print("All results:", results)

Parameters:

Name	Type	Description	Default
`queries`	`Dict[str, str]`	A dictionary where keys are query IDs and values are query strings.	required
`relevant_docs`	`Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]`	A dictionary mapping query IDs to their relevant document IDs. Can be a set of doc IDs for binary relevance, or a single doc_id string. Graded relevance (dict of doc_id to relevance score) is not supported for match evaluation.	required
`vespa_query_fn`	`Callable[[str, int, Optional[str]], dict]`	A function that takes a query string, the number of hits to retrieve (top_k), and an optional query_id, and returns a Vespa query body dictionary.	required
`app`	`Vespa`	An instance of the Vespa application.	required
`name`	`str`	A name for this evaluation run. Defaults to "".	`''`
`id_field`	`str`	The field name in the Vespa hit that contains the document ID. If empty, it tries to infer the ID from the 'id' field or 'fields.id'. Defaults to "".	`''`
`write_csv`	`bool`	Whether to write the summary evaluation results to a CSV file. Defaults to False.	`False`
`write_verbose`	`bool`	Whether to write detailed query-level results to a separate CSV file. Defaults to False.	`False`
`csv_dir`	`Optional[str]`	Directory to save the CSV files. Defaults to None (current directory).	`None`

`run()`

Executes the match-phase recall evaluation.

This method: 1. Sends a query with limit 0 to get the number of matched documents. 2. Sends a recall query with the relevant documents. 3. Computes recall metrics and match statistics. 4. Logs results and optionally writes them to CSV.

Returns:

Name	Type	Description
`dict`	`Dict[str, float]`	A dictionary containing recall metrics, match statistics, and search time statistics.

Example

{
    "match_recall": 0.85,
    "total_relevant_docs": 150,
    "total_matched_relevant": 128,
    "avg_matched_per_query": 45.2,
    "searchtime_avg": 0.015,
    ...
}

`VespaCollectorBase(queries, relevant_docs, vespa_query_fn, app, id_field, name='', csv_dir=None, random_hits_strategy=RandomHitsSamplingStrategy.RATIO, random_hits_value=1.0, max_random_hits_per_query=None, collect_matchfeatures=True, collect_rankfeatures=False, collect_summaryfeatures=False, write_csv=True)`

Bases: ABC

Abstract base class for Vespa training data collectors providing initialization and interface.

Initialize the VespaFeatureCollector.

Parameters:

Name	Type	Description	Default
`queries`	`Dict[str, str]`	Dictionary mapping query IDs to query strings	required
`relevant_docs`	`Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]`	Dictionary mapping query IDs to relevant document IDs	required
`vespa_query_fn`	`Callable[[str, int, Optional[str]], dict]`	Function to generate Vespa query bodies	required
`app`	`Vespa`	Vespa application instance	required
`id_field`	`str`	Field name containing document IDs in Vespa hits (must be defined as an attribute in the schema)	required
`name`	`str`	Name for this collection run	`''`
`csv_dir`	`Optional[str]`	Directory to save CSV files	`None`
`random_hits_strategy`	`Union[RandomHitsSamplingStrategy, str]`	Strategy for sampling random hits - either "ratio" or "fixed" - RATIO: Sample random hits as a ratio of relevant docs - FIXED: Sample a fixed number of random hits per query	`RATIO`
`random_hits_value`	`Union[float, int]`	Value for the sampling strategy - For RATIO: Ratio value (e.g., 1.0 = equal, 2.0 = twice as many random hits) - For FIXED: Fixed number of random hits per query	`1.0`
`max_random_hits_per_query`	`Optional[int]`	Optional maximum limit on random hits per query (only applies when using RATIO strategy to prevent excessive sampling)	`None`
`collect_matchfeatures`	`bool`	Whether to collect match features	`True`
`collect_rankfeatures`	`bool`	Whether to collect rank features	`False`
`collect_summaryfeatures`	`bool`	Whether to collect summary features	`False`
`write_csv`	`bool`	Whether to write results to CSV file	`True`

`collect()` `abstractmethod`

Abstract method to be implemented by subclasses.

`call()`

Make the collector callable.

`VespaFeatureCollector(queries, relevant_docs, vespa_query_fn, app, id_field, name='', csv_dir=None, random_hits_strategy=RandomHitsSamplingStrategy.RATIO, random_hits_value=1.0, max_random_hits_per_query=None, collect_matchfeatures=True, collect_rankfeatures=False, collect_summaryfeatures=False, write_csv=True)`

Bases: VespaCollectorBase

Collects training data for retrieval tasks from a Vespa application.

This class:

Iterates over queries and issues them against your Vespa application.
Retrieves top-k documents per query.
Samples random hits based on the specified strategy.
Compiles a CSV file with query-document pairs and their relevance labels.

Important: If you want to sample random hits, you need to make sure that the rank profile you define in your vespa_query_fn has a ranking expression that reflects this. See docs for example. In this case, be aware that the relevance_score value in the returned results (or CSV) will be of no value. This will only have meaning if you use this to collect features for relevant documents only.

Example usage

from vespa.application import Vespa
from vespa.evaluation import VespaFeatureCollector

queries = {
    "q1": "What is the best GPU for gaming?",
    "q2": "How to bake sourdough bread?",
    # ...
}
relevant_docs = {
    "q1": {"d12", "d99"},
    "q2": {"d101"},
    # ...
}

def my_vespa_query_fn(query_text: str, top_k: int) -> dict:
    return {
        "yql": 'select * from sources * where userInput("' + query_text + '");',
        "hits": 10,  # Do not make use of top_k here.
        "ranking": "your_ranking_profile", # This should have `random` as ranking expression
    }

app = Vespa(url="http://localhost", port=8080)

collector = VespaFeatureCollector(
    queries=queries,
    relevant_docs=relevant_docs,
    vespa_query_fn=my_vespa_query_fn,
    app=app,
    id_field="id",  # Field in Vespa hit that contains the document ID (must be an attribute)
    name="retrieval-data-collection",
    csv_dir="/path/to/save/csv",
    random_hits_strategy="ratio",  # or RandomHitsSamplingStrategy.RATIO
    random_hits_value=1.0,  # Sample equal number of random hits to relevant docs
    max_random_hits_per_query=100,  # Optional: cap random hits per query
    collect_matchfeatures=True,  # Collect match features from rank profile
    collect_rankfeatures=False,  # Skip traditional rank features
    collect_summaryfeatures=False,  # Skip summary features
)

collector()

Alternative Usage Examples:

# Example 1: Fixed number of random hits per query
collector = VespaFeatureCollector(
    queries=queries,
    relevant_docs=relevant_docs,
    vespa_query_fn=my_vespa_query_fn,
    app=app,
    id_field="id",  # Required field name
    random_hits_strategy="fixed",
    random_hits_value=50,  # Always sample 50 random hits per query
)

# Example 2: Ratio-based with a cap
collector = VespaFeatureCollector(
    queries=queries,
    relevant_docs=relevant_docs,
    vespa_query_fn=my_vespa_query_fn,
    app=app,
    id_field="id",  # Required field name
    random_hits_strategy="ratio",
    random_hits_value=2.0,  # Sample twice as many random hits as relevant docs
    max_random_hits_per_query=200,  # But never more than 200 per query
)

Parameters:

Name	Type	Description	Default
`queries`	`Dict[str, str]`	A dictionary where keys are query IDs and values are query strings.	required
`relevant_docs`	`Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]`	A dictionary mapping query IDs to their relevant document IDs. Can be a set of doc IDs for binary relevance, a dict of doc_id to relevance score (float between 0 and 1) for graded relevance, or a single doc_id string.	required
`vespa_query_fn`	`Callable[[str, int, Optional[str]], dict]`	A function that takes a query string, the number of hits to retrieve (top_k), and an optional query_id, and returns a Vespa query body dictionary.	required
`app`	`Vespa`	An instance of the Vespa application.	required
`id_field`	`str`	The field name in the Vespa hit that contains the document ID. This field must be defined as an attribute in your Vespa schema.	required
`name`	`str`	A name for this data collection run. Defaults to "".	`''`
`csv_dir`	`Optional[str]`	Directory to save the CSV file. Defaults to None (current directory).	`None`
`random_hits_strategy`	`Union[RandomHitsSamplingStrategy, str]`	Strategy for sampling random hits. Can be "ratio" (or RandomHitsSamplingStrategy.RATIO) to sample as a ratio of relevant docs, or "fixed" (or RandomHitsSamplingStrategy.FIXED) to sample a fixed number per query. Defaults to "ratio".	`RATIO`
`random_hits_value`	`Union[float, int]`	Value for the sampling strategy. For RATIO strategy: ratio value (e.g., 1.0 = equal number, 2.0 = twice as many random hits). For FIXED strategy: fixed number of random hits per query. Defaults to 1.0.	`1.0`
`max_random_hits_per_query`	`Optional[int]`	Maximum limit on random hits per query. Only applies to RATIO strategy to prevent excessive sampling. Defaults to None (no limit).	`None`
`collect_matchfeatures`	`bool`	Whether to collect match features defined in rank profile's match-features section. Defaults to True.	`True`
`collect_rankfeatures`	`bool`	Whether to collect rank features using ranking.listFeatures=true. Defaults to False.	`False`
`collect_summaryfeatures`	`bool`	Whether to collect summary features from document summaries. Defaults to False.	`False`
`write_csv`	`bool`	Whether to write results to CSV file. Defaults to True.	`True`

`get_recall_param(relevant_doc_ids, get_relevant)`

Adds the recall parameter to the Vespa query body based on relevant document IDs.

Parameters:

Name	Type	Description	Default
`relevant_doc_ids`	`set`	A set of relevant document IDs.	required
`get_relevant`	`bool`	Whether to retrieve relevant documents.	required

Returns:

Name	Type	Description
`dict`	`dict`	The updated Vespa query body with the recall parameter.

`calculate_random_hits_count(num_relevant_docs)`

Calculate the number of random hits to sample based on the configured strategy.

Parameters:

Name	Type	Description	Default
`num_relevant_docs`	`int`	Number of relevant documents for the query	required

Returns:

Type	Description
`int`	Number of random hits to sample

`collect()`

Collects training data by executing queries and saving results to CSV.

This method: 1. Executes all configured queries against the Vespa application. 2. Collects the top-k document IDs and their relevance labels. 3. Optionally writes the data to a CSV file for training purposes. 4. Returns the collected data as a single dictionary with results.

Returns:

Type	Description
`Dict[str, List[Dict]]`	Dict containing:
`Dict[str, List[Dict]]`	'results': List of dictionaries, each containing all data for a query-document pair (query_id, doc_id, relevance_label, relevance_score, and all extracted features)

`mean(values)`

Compute the mean of a list of numbers without using numpy.

`percentile(values, p)`

Compute the p-th percentile of a list of values (0 <= p <= 100). This approximates numpy.percentile's behavior.

`validate_queries(queries)`

Validate and normalize queries. Converts query IDs to strings if they are ints.

`validate_qrels(qrels)`

Validate and normalize qrels. Converts query IDs to strings if they are ints.

`validate_vespa_query_fn(fn)`

Validates the vespa_query_fn function.

The function must be callable and accept either 2 or 3 parameters

(query_text: str, top_k: int)
or (query_text: str, top_k: int, query_id: Optional[str])

It must return a dictionary when called with test inputs.

Returns True if the function takes a query_id parameter, False otherwise.

`filter_queries(queries, relevant_docs)`

Filter out queries that have no relevant docs

`extract_doc_id_from_hit(hit, id_field)`

Extract document ID from a Vespa hit.

`get_id_field_from_hit(hit, id_field)`

Get the ID field from a Vespa hit.

`calculate_searchtime_stats(searchtimes)`

Calculate search time statistics.

`execute_queries(app, query_bodies)`

Execute queries and collect timing information. Returns the responses and a list of search times.

`write_csv(metrics, searchtime_stats, csv_file, csv_dir, name)`

Write metrics to CSV file.

`log_metrics(name, metrics)`

Log metrics with appropriate formatting.

`extract_features_from_hit(hit, collect_matchfeatures, collect_rankfeatures, collect_summaryfeatures)`

Extract features from a Vespa hit based on the collection configuration.

Parameters:

Name	Type	Description	Default
`hit`	`dict`	The Vespa hit dictionary	required
`collect_matchfeatures`	`bool`	Whether to collect match features	required
`collect_rankfeatures`	`bool`	Whether to collect rank features	required
`collect_summaryfeatures`	`bool`	Whether to collect summary features	required

Returns:

Type	Description
`Dict[str, float]`	Dict mapping feature names to values

Evaluation

vespa.evaluation

RandomHitsSamplingStrategy

VespaEvaluatorBase(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', write_csv=False, csv_dir=None)

run() abstractmethod

__call__()

VespaEvaluator(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', accuracy_at_k=[1, 3, 5, 10], precision_recall_at_k=[1, 3, 5, 10], mrr_at_k=[10], ndcg_at_k=[10], map_at_k=[100], write_csv=False, csv_dir=None)

run()

VespaMatchEvaluator(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', rank_profile='unranked', write_csv=False, write_verbose=False, csv_dir=None)

run()

collect() abstractmethod

__call__()

get_recall_param(relevant_doc_ids, get_relevant)

calculate_random_hits_count(num_relevant_docs)

collect()

mean(values)

percentile(values, p)

validate_queries(queries)

validate_qrels(qrels)

validate_vespa_query_fn(fn)

filter_queries(queries, relevant_docs)

extract_doc_id_from_hit(hit, id_field)

get_id_field_from_hit(hit, id_field)

calculate_searchtime_stats(searchtimes)

execute_queries(app, query_bodies)

write_csv(metrics, searchtime_stats, csv_file, csv_dir, name)

log_metrics(name, metrics)

extract_features_from_hit(hit, collect_matchfeatures, collect_rankfeatures, collect_summaryfeatures)

`vespa.evaluation`

`RandomHitsSamplingStrategy`

`VespaEvaluatorBase(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', write_csv=False, csv_dir=None)`

`run()` `abstractmethod`

`call()`

`VespaEvaluator(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', accuracy_at_k=[1, 3, 5, 10], precision_recall_at_k=[1, 3, 5, 10], mrr_at_k=[10], ndcg_at_k=[10], map_at_k=[100], write_csv=False, csv_dir=None)`

`run()`

`VespaMatchEvaluator(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', rank_profile='unranked', write_csv=False, write_verbose=False, csv_dir=None)`

`run()`

`collect()` `abstractmethod`

`call()`

`get_recall_param(relevant_doc_ids, get_relevant)`

`calculate_random_hits_count(num_relevant_docs)`

`collect()`

`mean(values)`

`percentile(values, p)`

`validate_queries(queries)`

`validate_qrels(qrels)`

`validate_vespa_query_fn(fn)`

`filter_queries(queries, relevant_docs)`

`extract_doc_id_from_hit(hit, id_field)`

`get_id_field_from_hit(hit, id_field)`

`calculate_searchtime_stats(searchtimes)`

`execute_queries(app, query_bodies)`

`write_csv(metrics, searchtime_stats, csv_file, csv_dir, name)`

`log_metrics(name, metrics)`

`extract_features_from_hit(hit, collect_matchfeatures, collect_rankfeatures, collect_summaryfeatures)`