Evaluation
vespa.evaluation
VespaEvaluator(queries, relevant_docs, vespa_query_fn, app, name='', id_field='', accuracy_at_k=[1, 3, 5, 10], precision_recall_at_k=[1, 3, 5, 10], mrr_at_k=[10], ndcg_at_k=[10], map_at_k=[100], write_csv=False, csv_dir=None)
Evaluate retrieval performance on a Vespa application.
This class:
- Iterates over queries and issues them against your Vespa application.
- Retrieves top-k documents per query (with k = max of your IR metrics).
- Compares the retrieved documents with a set of relevant document ids.
- Computes IR metrics: Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k.
- Logs vespa search times for each query.
- Logs/returns these metrics.
- Optionally writes out to CSV.
Example usage
from vespa.application import Vespa
from vespa.evaluation import VespaEvaluator
queries = {
"q1": "What is the best GPU for gaming?",
"q2": "How to bake sourdough bread?",
# ...
}
relevant_docs = {
"q1": {"d12", "d99"},
"q2": {"d101"},
# ...
}
# relevant_docs can also be a dict of query_id => single relevant doc_id
# relevant_docs = {
# "q1": "d12",
# "q2": "d101",
# # ...
# }
# Or, relevant_docs can be a dict of query_id => map of doc_id => relevance
# relevant_docs = {
# "q1": {"d12": 1, "d99": 0.1},
# "q2": {"d101": 0.01},
# # ...
# Note that for non-binary relevance, the relevance values should be in [0, 1], and that
# only the nDCG metric will be computed.
def my_vespa_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": 'select * from sources * where userInput("' + query_text + '");',
"hits": top_k,
"ranking": "your_ranking_profile",
}
app = Vespa(url="http://localhost", port=8080)
evaluator = VespaEvaluator(
queries=queries,
relevant_docs=relevant_docs,
vespa_query_fn=my_vespa_query_fn,
app=app,
name="test-run",
accuracy_at_k=[1, 3, 5],
precision_recall_at_k=[1, 3, 5],
mrr_at_k=[10],
ndcg_at_k=[10],
map_at_k=[100],
write_csv=True
)
results = evaluator()
print("Primary metric:", evaluator.primary_metric)
print("All results:", results)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
queries
|
dict
|
A dictionary of query_id => query text. |
required |
relevant_docs
|
dict
|
A dictionary of query_id => set of relevant doc_ids or query_id => dict of doc_id => relevance. See example usage. |
required |
vespa_query_fn
|
callable
|
A callable with the signature |
required |
app
|
Vespa
|
A |
required |
name
|
str
|
A name or tag for this evaluation run. |
''
|
id_field
|
str
|
The field name in Vespa that contains the document ID. If unset, will try to use the Vespa internal document ID, but this may fail in some cases (see https://docs.vespa.ai/en/documents.html#docid-in-results). |
''
|
accuracy_at_k
|
list of int
|
List of k-values for Accuracy@k. |
[1, 3, 5, 10]
|
precision_recall_at_k
|
list of int
|
List of k-values for Precision@k and Recall@k. |
[1, 3, 5, 10]
|
mrr_at_k
|
list of int
|
List of k-values for MRR@k. |
[10]
|
ndcg_at_k
|
list of int
|
List of k-values for NDCG@k. |
[10]
|
map_at_k
|
list of int
|
List of k-values for MAP@k. |
[100]
|
write_csv
|
bool
|
If True, writes results to CSV. |
False
|
csv_dir
|
str
|
Path to write the CSV file (default is the current working directory). |
None
|
filter_queries(queries, relevant_docs)
Filter out queries that have no relevant docs
run()
Executes the evaluation by running queries and computing IR metrics.
This method: 1. Executes all configured queries against the Vespa application. 2. Collects search results and timing information. 3. Computes the configured IR metrics (Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k). 4. Records search timing statistics. 5. Logs results and optionally writes them to CSV.
Returns:
Name | Type | Description |
---|---|---|
dict |
Dict[str, float]
|
A dictionary containing: - IR metrics with names like "accuracy@k", "precision@k", etc. - Search time statistics ("searchtime_avg", "searchtime_q50", etc.). The values are floats between 0 and 1 for metrics and in seconds for timing. |
mean(values)
Compute the mean of a list of numbers without using numpy.
percentile(values, p)
Compute the p-th percentile of a list of values (0 <= p <= 100). This approximates numpy.percentile's behavior.