Evaluating a Vespa Application¶
We are often asked by users and customers what is the best retrieval and ranking strategy for a given use case. And even though we might sometimes have an intuition, we always recommend to set up experiments and do a proper quantitative evaluation.
Models are temporary; Evals are forever.
-Eugene Yan
Without a proper evaluation setup, you run the risk of settling for lgtm@10
(looks good to me @ 10).
Then, if you deploy your application to users, you can be sure that you will get feedback of queries that does not produce relevant results. If you then try to optimize for that without knowing whether your tweaks are actually improving the overall quality of your search, you might end up with a system that is worse than the one you started with.
So, what can you do?¶
You can set up a proper evaluation pipeline, where you can test different ranking strategies, and see how they perform on a set of evaluation queries that act as a proxy for your real users' queries. This way, you can make informed decisions about what works best for your use case. If you collect real user interactions, it could be even better, but it is important to also keep the evaluation pipeline light enough so that you can run it both during development and in CI pipelines (possibly at different scales).
This guide will show how you easily can evaluate a Vespa application using pyvespa's VespaMatchEvaluator
and VespaEvaluator
class.
Evaluate match-phase (retrieval) for recall¶
The match-phase (or retrieval phase) in Vespa is perform to retrieve candidate documents to rank. Here, what we care about is that all possibly relevant documents are retrieved fast, without matching too many documents. If we match too many documents, latency will suffer, as all retrieved docs will be exposed to ranking. For an introduction to phased retrieval in Vespa, see the docs
For this tutorial, we will evaluate and compare weakAnd
, nearestNeighbor
, as well as the combination of the two (using OR
-operator).
Evaluate ranking¶
We will define and compare 4 different ranking strategies in this guide:
bm25
- Keyword-based retrieval and ranking - The solid baseline.semantic
- Vector search using cosine similarity (using https://huggingface.co/intfloat/e5-small-v2 for embeddings)fusion
- Hybrid search (semantic+keyword). Combining BM25 and Semantic with reciprocal rank fusionatan_norm
- Hybrid search, combining BM25 and Semantic with atan normalization as described in Aapo Tanskanen's Guidebook to the State-of-the-Art Embeddings and Information Retrieval (Originally proposed by Seo et al. (2022))
Pre-requisite: Create a tenant at cloud.vespa.ai, save the tenant name.
Install¶
Install pyvespa >= 0.53.0 and the Vespa CLI. The Vespa CLI is used for data and control plane key management (Vespa Cloud Security Guide).
!pip3 install pyvespa vespacli datasets pandas
Configure application¶
# Replace with your tenant name from the Vespa Cloud Console
tenant_name = "vespa-team"
# Replace with your application name (does not need to exist yet)
application = "evaluation"
schema_name = "doc"
Create an application package¶
The application package has all the Vespa configuration files - create one from scratch:
from vespa.package import (
ApplicationPackage,
Field,
Schema,
Document,
HNSW,
RankProfile,
Component,
Parameter,
FieldSet,
GlobalPhaseRanking,
Function,
)
import pandas as pd
package = ApplicationPackage(
name=application,
schema=[
Schema(
name=schema_name,
document=Document(
fields=[
# Note that we need an id field as attribute to be able to do evaluation
# Vespa internal query document id is used as fallback, but have some limitations, see https://docs.vespa.ai/en/document-v1-api-guide.html#query-result-id
Field(name="id", type="string", indexing=["summary", "attribute"]),
Field(
name="text",
type="string",
indexing=["index", "summary"],
index="enable-bm25",
bolding=True,
),
Field(
name="embedding",
type="tensor<float>(x[384])",
indexing=[
"input text",
"embed", # uses default model
"index",
"attribute",
],
ann=HNSW(distance_metric="angular"),
is_document_field=False,
),
]
),
fieldsets=[FieldSet(name="default", fields=["text"])],
rank_profiles=[
RankProfile(
name="match-only",
inputs=[("query(q)", "tensor<float>(x[384])")],
first_phase="random", # TODO: Remove when pyvespa supports empty first_phase
),
RankProfile(
name="bm25",
inputs=[("query(q)", "tensor<float>(x[384])")],
functions=[Function(name="bm25text", expression="bm25(text)")],
first_phase="bm25text",
match_features=["bm25text"],
),
RankProfile(
name="semantic",
inputs=[("query(q)", "tensor<float>(x[384])")],
functions=[
Function(
name="cos_sim", expression="closeness(field, embedding)"
)
],
first_phase="cos_sim",
match_features=["cos_sim"],
),
RankProfile(
name="fusion",
inherits="bm25",
functions=[
Function(
name="cos_sim", expression="closeness(field, embedding)"
)
],
inputs=[("query(q)", "tensor<float>(x[384])")],
first_phase="cos_sim",
global_phase=GlobalPhaseRanking(
expression="reciprocal_rank_fusion(bm25text, closeness(field, embedding))",
rerank_count=1000,
),
match_features=["cos_sim", "bm25text"],
),
RankProfile(
name="atan_norm",
inherits="bm25",
inputs=[("query(q)", "tensor<float>(x[384])")],
functions=[
Function(
name="scale",
args=["val"],
expression="2*atan(val)/(3.14159)",
),
Function(
name="normalized_bm25", expression="scale(bm25(text))"
),
Function(
name="cos_sim", expression="closeness(field, embedding)"
),
],
first_phase="normalized_bm25",
global_phase=GlobalPhaseRanking(
expression="normalize_linear(normalized_bm25) + normalize_linear(cos_sim)",
rerank_count=1000,
),
match_features=["cos_sim", "normalized_bm25"],
),
],
)
],
components=[
Component(
id="e5",
type="hugging-face-embedder",
parameters=[
Parameter(
"transformer-model",
{
"model-id": "e5-small-v2"
}, # in vespa cloud, we can use the model-id for selected models, see https://cloud.vespa.ai/en/model-hub
),
Parameter(
"tokenizer-model",
{"model-id": "e5-base-v2-vocab"},
),
],
)
],
)
Note that the name cannot have -
or _
.
Deploy to Vespa Cloud¶
The app is now defined and ready to deploy to Vespa Cloud.
Deploy package
to Vespa Cloud, by creating an instance of
VespaCloud:
from vespa.deployment import VespaCloud
import os
# Key is only used for CI/CD. Can be removed if logging in interactively
vespa_cloud = VespaCloud(
tenant=tenant_name,
application=application,
key_content=os.getenv(
"VESPA_TEAM_API_KEY", None
), # Key is only used for CI/CD. Can be removed if logging in interactively
application_package=package,
)
Setting application... Running: vespa config set application vespa-team.evaluation.default Setting target cloud... Running: vespa config set target cloud Api-key found for control plane access. Using api-key.
For more details on different authentication options and methods, see authenticating-to-vespa-cloud.
The following will upload the application package to Vespa Cloud Dev Zone (aws-us-east-1c
), read more about Vespa Zones.
The Vespa Cloud Dev Zone is considered as a sandbox environment where resources are down-scaled and idle deployments are expired automatically.
For information about production deployments, see the following method.
Note: Deployments to dev and perf expire after 7 days of inactivity, i.e., 7 days after running deploy. This applies to all plans, not only the Free Trial. Use the Vespa Console to extend the expiry period, or redeploy the application to add 7 more days.
Now deploy the app to Vespa Cloud dev zone.
The first deployment typically takes 2 minutes until the endpoint is up. (Applications that for example refer to large onnx-models may take a bit longer.)
from vespa.application import Vespa
app: Vespa = vespa_cloud.deploy()
Deployment started in run 4 of dev-aws-us-east-1c for vespa-team.evaluation. This may take a few minutes the first time. INFO [11:21:29] Deploying platform version 8.530.11 and application dev build 4 for dev-aws-us-east-1c of default ... INFO [11:21:29] Using CA signed certificate version 5 INFO [11:21:29] Using 1 nodes in container cluster 'evaluation_container' INFO [11:21:31] Using 1 nodes in container cluster 'evaluation_container' INFO [11:21:34] Session 361140 for tenant 'vespa-team' prepared and activated. INFO [11:21:34] ######## Details for all nodes ######## INFO [11:21:34] h118215a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [11:21:34] --- platform vespa/cloud-tenant-rhel8:8.530.11 INFO [11:21:34] --- container-clustercontroller on port 19050 has config generation 361121, wanted is 361140 INFO [11:21:34] --- metricsproxy-container on port 19092 has config generation 361140, wanted is 361140 INFO [11:21:34] h119458a.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [11:21:34] --- platform vespa/cloud-tenant-rhel8:8.530.11 INFO [11:21:34] --- container on port 4080 has config generation 361140, wanted is 361140 INFO [11:21:34] --- metricsproxy-container on port 19092 has config generation 361140, wanted is 361140 INFO [11:21:34] h97530b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [11:21:34] --- platform vespa/cloud-tenant-rhel8:8.530.11 INFO [11:21:34] --- logserver-container on port 4080 has config generation 361140, wanted is 361140 INFO [11:21:34] --- metricsproxy-container on port 19092 has config generation 361121, wanted is 361140 INFO [11:21:34] h112930b.dev.us-east-1c.aws.vespa-cloud.net: expected to be UP INFO [11:21:34] --- platform vespa/cloud-tenant-rhel8:8.530.11 INFO [11:21:34] --- storagenode on port 19102 has config generation 361121, wanted is 361140 INFO [11:21:34] --- searchnode on port 19107 has config generation 361140, wanted is 361140 INFO [11:21:34] --- distributor on port 19111 has config generation 361140, wanted is 361140 INFO [11:21:34] --- metricsproxy-container on port 19092 has config generation 361140, wanted is 361140 INFO [11:21:49] Found endpoints: INFO [11:21:49] - dev.aws-us-east-1c INFO [11:21:49] |-- https://c34ea535.ccc9bd09.z.vespa-app.cloud/ (cluster 'evaluation_container') INFO [11:21:49] Deployment of new application complete! Only region: aws-us-east-1c available in dev environment. Found mtls endpoint for evaluation_container URL: https://c34ea535.ccc9bd09.z.vespa-app.cloud/ Application is up!
If the deployment failed, it is possible you forgot to add the key in the Vespa Cloud Console in the vespa auth api-key
step above.
If you can authenticate, you should see lines like the following
Deployment started in run 1 of dev-aws-us-east-1c for mytenant.hybridsearch.
The deployment takes a few minutes the first time while Vespa Cloud sets up the resources for your Vespa application
app
now holds a reference to a Vespa instance. We can access the
mTLS protected endpoint name using the control-plane (vespa_cloud) instance. This endpoint we can query and feed to (data plane access) using the
mTLS certificate generated in previous steps.
See Authenticating to Vespa Cloud for details on using token authentication instead of mTLS.
Getting your evaluation data¶
For evaluating information retrieval methods, in addition to the document corpus, we also need a set of queries and a mapping from queries to relevant documents.
For this guide, we will use the NanoMSMARCO dataset, made available on huggingface by Zeta Alpha.
This dataset is a subset of their 🍺NanoBEIR-collection, with 50 queries and up to 10K documents each.
This is a great dataset for testing and evaluating information retrieval methods quickly, as it is small and easy to work with.
Note that for almost any real-world use case, we would recommend you to create your own evaluation dataset. See Vespa blog post on how you can get help from an LLM for this.
Note that creating 20-50 queries and annotating relevant documents for each query can be a good start and well worth the effort.
from datasets import load_dataset
dataset_id = "zeta-alpha-ai/NanoMSMARCO"
dataset = load_dataset(dataset_id, "corpus", split="train", streaming=True)
vespa_feed = dataset.map(
lambda x: {
"id": x["_id"],
"fields": {"text": x["text"], "id": x["_id"]},
}
)
Note that we are only evaluating rank strategies here, we consider it OK to use the train
split for evaluation. If we were to make changes to our ranking strategies, such as adding weighting terms, or training ml models for ranking, we would suggest to adopt a train
, validation
, test
split approach to avoid overfitting.
query_ds = load_dataset(dataset_id, "queries", split="train")
qrels = load_dataset(dataset_id, "qrels", split="train")
ids_to_query = dict(zip(query_ds["_id"], query_ds["text"]))
Let us print the first 5 queries:
for idx, (qid, q) in enumerate(ids_to_query.items()):
print(f"qid: {qid}, query: {q}")
if idx == 5:
break
qid: 994479, query: which health care system provides all citizens or residents with equal access to health care services qid: 1009388, query: what's right in health care qid: 1088332, query: weather in oran qid: 265729, query: how long keep financial records qid: 1099433, query: how do hoa fees work qid: 200600, query: heels or heal
relevant_docs = dict(zip(qrels["query-id"], qrels["corpus-id"]))
Let us print the first 5 query ids and their relevant documents:
for idx, (qid, doc_id) in enumerate(relevant_docs.items()):
print(f"qid: {qid}, doc_id: {doc_id}")
if idx == 5:
break
qid: 994479, doc_id: 7275120 qid: 1009388, doc_id: 7248824 qid: 1088332, doc_id: 7094398 qid: 265729, doc_id: 7369987 qid: 1099433, doc_id: 7255675 qid: 200600, doc_id: 7929603
We can see that this dataset only has one relevant document per query.
The VespaEvaluator
class handles this just fine, but you could also provide a set of relevant documents per query if there are multiple relevant docs.
# multiple relevant docs per query
qrels = {
"q1": {"doc1", "doc2"},
"q2": {"doc3", "doc4"},
# etc.
}
Now we can feed to Vespa using feed_iterable
which accepts any Iterable
and an optional callback function where we can
check the outcome of each operation. The application is configured to use embedding
functionality, that produce a vector embedding using a concatenation of the title and the body input fields. This step may be resource intensive, depending on the model size.
Read more about embedding inference in Vespa in the Accelerating Transformer-based Embedding Retrieval with Vespa blog post.
Default node resources in Vespa Cloud have 2 v-cpu for the Dev Zone.
from vespa.io import VespaResponse
def callback(response: VespaResponse, id: str):
if not response.is_successful():
print(f"Error when feeding document {id}: {response.get_json()}")
app.feed_iterable(vespa_feed, schema="doc", namespace="tutorial", callback=callback)
Evaluate match-phase¶
There are two separate classes provided for doing evaluations:
VespaMatchEvaluator
, which is intended to evaluate only the retrieval (or match-phase), and should not do any ranking. This is useful to evaluate whether your relevant documents are retrieved (and thus exposed to ranking). It only computes recall, total matched documents per query as well assearchtime
.VespaEvaluator
is intended to evaluate a complete ranking strategy, across several common IR metrics.
Both API's are inspired by SentenceTransformers InformationRetrievalEvaluator
class.
The difference it that VespaMatchEvaluator
evaluates only the retrieval phase, while VespaEvaluator
evaluates your whole retrieval and ranking system (Vespa application) as opposed to a single model. Your application should be fed with the document corpus in advance, instead of taking in the document corpus.
We now have created the app, the queries, and the relevant documents. The only thing missing before we can initialize the VespaMatchEvaluator
is a set of functions that defines the Vespa queries. Each of them is passed as vespa_query_fn
.
We will use the vespa.querybuilder
module to create the queries. See reference doc and example notebook for more details on usage.
This module is a Python wrapper around the Vespa Query Language (YQL), which is an alternative to providing the YQL query as a string directly.
import vespa.querybuilder as qb
def match_weakand_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": str(qb.select("*").from_(schema_name).where(qb.userQuery(query_text))),
"query": query_text,
"ranking": "match-only",
"input.query(q)": f"embed({query_text})",
}
def match_hybrid_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": str(
qb.select("*")
.from_(schema_name)
.where(
qb.nearestNeighbor(
field="embedding",
query_vector="q",
annotations={"targetHits": 100},
)
| qb.userQuery(
query_text,
)
)
),
"query": query_text,
"ranking": "match-only",
"input.query(q)": f"embed({query_text})",
}
def match_semantic_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": str(
qb.select("*")
.from_(schema_name)
.where(
qb.nearestNeighbor(
field="embedding",
query_vector="q",
annotations={"targetHits": 100},
)
)
),
"query": query_text,
"ranking": "match-only",
"input.query(q)": f"embed({query_text})",
}
Now, let us run the evaluator:
from vespa.evaluation import VespaMatchEvaluator
match_results = {}
for evaluator_name, query_fn in [
("semantic", match_semantic_query_fn),
("weakand", match_weakand_query_fn),
("hybrid", match_hybrid_query_fn),
]:
print(f"Evaluating {evaluator_name}...")
match_evaluator = VespaMatchEvaluator(
queries=ids_to_query,
relevant_docs=relevant_docs,
vespa_query_fn=query_fn,
app=app,
name="test-run",
write_csv=True,
write_verbose=True, # optionally write verbose metrics to CSV
)
results = match_evaluator()
match_results[evaluator_name] = results
print(f"Results for {evaluator_name}:")
print(results)
Evaluating semantic... Results for semantic: {'match_recall': 1.0, 'avg_recall_per_query': 1.0, 'total_relevant_docs': 50, 'total_matched_relevant': 50, 'avg_matched_per_query': 100.0, 'total_queries': 50, 'searchtime_avg': 0.04836, 'searchtime_q50': 0.042, 'searchtime_q90': 0.0822, 'searchtime_q95': 0.09609999999999999} Evaluating weakand... Results for weakand: {'match_recall': 1.0, 'avg_recall_per_query': 1.0, 'total_relevant_docs': 50, 'total_matched_relevant': 50, 'avg_matched_per_query': 809.88, 'total_queries': 50, 'searchtime_avg': 0.03996999999999997, 'searchtime_q50': 0.036500000000000005, 'searchtime_q90': 0.060399999999999975, 'searchtime_q95': 0.07915} Evaluating hybrid... Results for hybrid: {'match_recall': 1.0, 'avg_recall_per_query': 1.0, 'total_relevant_docs': 50, 'total_matched_relevant': 50, 'avg_matched_per_query': 833.22, 'total_queries': 50, 'searchtime_avg': 0.035129999999999974, 'searchtime_q50': 0.037, 'searchtime_q90': 0.0501, 'searchtime_q95': 0.0521}
By setting write_csv=True
and verbose=True
, we can save a CSV-file for each query to inspect further the queries that were not matched.
This is important to understand how you could improve recall if some relevant documents were not matched.
results = pd.DataFrame(match_results)
results
semantic | weakand | hybrid | |
---|---|---|---|
match_recall | 1.00000 | 1.00000 | 1.00000 |
avg_recall_per_query | 1.00000 | 1.00000 | 1.00000 |
total_relevant_docs | 50.00000 | 50.00000 | 50.00000 |
total_matched_relevant | 50.00000 | 50.00000 | 50.00000 |
avg_matched_per_query | 100.00000 | 809.88000 | 833.22000 |
total_queries | 50.00000 | 50.00000 | 50.00000 |
searchtime_avg | 0.04836 | 0.03997 | 0.03513 |
searchtime_q50 | 0.04200 | 0.03650 | 0.03700 |
searchtime_q90 | 0.08220 | 0.06040 | 0.05010 |
searchtime_q95 | 0.09610 | 0.07915 | 0.05210 |
Here, we can see that all retrieval strategies actually match all the relevant documents.
To tune number of documents retrieved and latency, we could tune the targetHits
-parameter for both the nearestNeighbor
-operator and weakAnd
-parameter (our userInput
is converted to weakAnd
, see docs), as well as several additional weakAnd
-parameters. See Vespa blog for details. We will not go in detail of this in this notebook.
Evaluate ranking¶
Now, we will move on to demonstrate how to evaluate the ranking strategies. For that, we will use the VespaEvaluator
-class. Its interface is very similar to VespaMatchEvaluator
, with the difference that it has much more metrics available. Also note that number of hits
will affect the number of documents considered for evaluation.
VespaEvaluator¶
Let us take a look at its API and documentation:
from vespa.evaluation import VespaEvaluator
?VespaEvaluator
Init signature: VespaEvaluator( queries: 'Dict[str, str]', relevant_docs: 'Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]', vespa_query_fn: 'Callable[[str, int, Optional[str]], dict]', app: 'Vespa', name: 'str' = '', id_field: 'str' = '', accuracy_at_k: 'List[int]' = [1, 3, 5, 10], precision_recall_at_k: 'List[int]' = [1, 3, 5, 10], mrr_at_k: 'List[int]' = [10], ndcg_at_k: 'List[int]' = [10], map_at_k: 'List[int]' = [100], write_csv: 'bool' = False, csv_dir: 'Optional[str]' = None, ) Docstring: Evaluate retrieval performance on a Vespa application. This class: - Iterates over queries and issues them against your Vespa application. - Retrieves top-k documents per query (with k = max of your IR metrics). - Compares the retrieved documents with a set of relevant document ids. - Computes IR metrics: Accuracy@k, Precision@k, Recall@k, MRR@k, NDCG@k, MAP@k. - Logs vespa search times for each query. - Logs/returns these metrics. - Optionally writes out to CSV. Note: The 'id_field' needs to be marked as an attribute in your Vespa schema, so filtering can be done on it. Example usage: ```python from vespa.application import Vespa from vespa.evaluation import VespaEvaluator queries = { "q1": "What is the best GPU for gaming?", "q2": "How to bake sourdough bread?", # ... } relevant_docs = { "q1": {"d12", "d99"}, "q2": {"d101"}, # ... } # relevant_docs can also be a dict of query_id => single relevant doc_id # relevant_docs = { # "q1": "d12", # "q2": "d101", # # ... # } # Or, relevant_docs can be a dict of query_id => map of doc_id => relevance # relevant_docs = { # "q1": {"d12": 1, "d99": 0.1}, # "q2": {"d101": 0.01}, # # ... # Note that for non-binary relevance, the relevance values should be in [0, 1], and that # only the nDCG metric will be computed. def my_vespa_query_fn(query_text: str, top_k: int) -> dict: return { "yql": 'select * from sources * where userInput("' + query_text + '");', "hits": top_k, "ranking": "your_ranking_profile", } app = Vespa(url="http://localhost", port=8080) evaluator = VespaEvaluator( queries=queries, relevant_docs=relevant_docs, vespa_query_fn=my_vespa_query_fn, app=app, name="test-run", accuracy_at_k=[1, 3, 5], precision_recall_at_k=[1, 3, 5], mrr_at_k=[10], ndcg_at_k=[10], map_at_k=[100], write_csv=True ) results = evaluator() print("Primary metric:", evaluator.primary_metric) print("All results:", results) ``` Args: queries (Dict[str, str]): A dictionary where keys are query IDs and values are query strings. relevant_docs (Union[Dict[str, Union[Set[str], Dict[str, float]]], Dict[str, str]]): A dictionary mapping query IDs to their relevant document IDs. Can be a set of doc IDs for binary relevance, a dict of doc_id to relevance score (float between 0 and 1) for graded relevance, or a single doc_id string. vespa_query_fn (Callable[[str, int, Optional[str]], dict]): A function that takes a query string, the number of hits to retrieve (top_k), and an optional query_id, and returns a Vespa query body dictionary. app (Vespa): An instance of the Vespa application. name (str, optional): A name for this evaluation run. Defaults to "". id_field (str, optional): The field name in the Vespa hit that contains the document ID. If empty, it tries to infer the ID from the 'id' field or 'fields.id'. Defaults to "". accuracy_at_k (List[int], optional): List of k values for which to compute Accuracy@k. Defaults to [1, 3, 5, 10]. precision_recall_at_k (List[int], optional): List of k values for which to compute Precision@k and Recall@k. Defaults to [1, 3, 5, 10]. mrr_at_k (List[int], optional): List of k values for which to compute MRR@k. Defaults to [10]. ndcg_at_k (List[int], optional): List of k values for which to compute NDCG@k. Defaults to [10]. map_at_k (List[int], optional): List of k values for which to compute MAP@k. Defaults to [100]. write_csv (bool, optional): Whether to write the evaluation results to a CSV file. Defaults to False. csv_dir (Optional[str], optional): Directory to save the CSV file. Defaults to None (current directory). File: ~/Repos/pyvespa/vespa/evaluation.py Type: type Subclasses:
def semantic_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": str(
qb.select("*")
.from_(schema_name)
.where(
qb.nearestNeighbor(
field="embedding",
query_vector="q",
annotations={"targetHits": 100},
)
)
),
"query": query_text,
"ranking": "semantic",
"input.query(q)": f"embed({query_text})",
"hits": top_k,
}
def bm25_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": "select * from sources * where userQuery();", # provide the yql directly as a string
"query": query_text,
"ranking": "bm25",
"hits": top_k,
}
def fusion_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": str(
qb.select("*")
.from_(schema_name)
.where(
qb.nearestNeighbor(
field="embedding",
query_vector="q",
annotations={"targetHits": 100},
)
| qb.userQuery(query_text)
)
),
"query": query_text,
"ranking": "fusion",
"input.query(q)": f"embed({query_text})",
"hits": top_k,
}
def atan_norm_query_fn(query_text: str, top_k: int) -> dict:
return {
"yql": str(
qb.select("*")
.from_(schema_name)
.where(
qb.nearestNeighbor(
field="embedding",
query_vector="q",
annotations={"targetHits": 100},
)
| qb.userQuery(query_text)
)
),
"query": query_text,
"ranking": "atan_norm",
"input.query(q)": f"embed({query_text})",
"hits": top_k,
}
all_results = {}
for evaluator_name, query_fn in [
("semantic", semantic_query_fn),
("bm25", bm25_query_fn),
("fusion", fusion_query_fn),
("atan_norm", atan_norm_query_fn),
]:
print(f"Evaluating {evaluator_name}...")
evaluator = VespaEvaluator(
queries=ids_to_query,
relevant_docs=relevant_docs,
vespa_query_fn=query_fn,
app=app,
name=evaluator_name,
write_csv=True, # optionally write metrics to CSV
)
results = evaluator.run()
all_results[evaluator_name] = results
Evaluating semantic... Evaluating bm25... Evaluating fusion... Evaluating atan_norm...
Looking at the results¶
results = pd.DataFrame(all_results)
# take out all rows with "searchtime" to a separate dataframe
searchtime = results[results.index.str.contains("searchtime")]
results = results[~results.index.str.contains("searchtime")]
# Highlight the maximum value in each row
def highlight_max(s):
is_max = s == s.max()
return ["background-color: lightgreen; color: black;" if v else "" for v in is_max]
# Style the DataFrame: Highlight max values and format numbers to 4 decimals
styled_df = results.style.apply(highlight_max, axis=1).format("{:.4f}")
styled_df
semantic | bm25 | fusion | atan_norm | |
---|---|---|---|---|
accuracy@1 | 0.3800 | 0.3000 | 0.4400 | 0.4400 |
accuracy@3 | 0.6400 | 0.6000 | 0.6800 | 0.7000 |
accuracy@5 | 0.7200 | 0.6600 | 0.7200 | 0.7400 |
accuracy@10 | 0.8200 | 0.7400 | 0.8000 | 0.8000 |
precision@1 | 0.3800 | 0.3000 | 0.4400 | 0.4400 |
recall@1 | 0.3800 | 0.3000 | 0.4400 | 0.4400 |
precision@3 | 0.2133 | 0.2000 | 0.2267 | 0.2333 |
recall@3 | 0.6400 | 0.6000 | 0.6800 | 0.7000 |
precision@5 | 0.1440 | 0.1320 | 0.1440 | 0.1480 |
recall@5 | 0.7200 | 0.6600 | 0.7200 | 0.7400 |
precision@10 | 0.0820 | 0.0740 | 0.0800 | 0.0800 |
recall@10 | 0.8200 | 0.7400 | 0.8000 | 0.8000 |
mrr@10 | 0.5309 | 0.4501 | 0.5529 | 0.5738 |
ndcg@10 | 0.6007 | 0.5206 | 0.6126 | 0.6296 |
map@100 | 0.5393 | 0.4594 | 0.5630 | 0.5838 |
We can see that for this particular dataset, the hybrid strategy atan_norm
is the best across all metrics.
results.plot(kind="bar", figsize=(12, 6))
<Axes: >
Looking at searchtimes¶
Ranking quality is not the only thing that matters. For many applications, search time is equally important.
# plot search time, add (ms) to the y-axis
# convert to ms
searchtime = searchtime * 1000
searchtime.plot(kind="bar", figsize=(12, 6)).set(ylabel="time (ms)")
[Text(0, 0.5, 'time (ms)')]
We can see that both hybrid strategies, fusion
and atan_norm
strategy is a bit slower on average than pure bm25
or semantic
, as expected.
Depending on the latency budget of your application, this is likely still an attractive trade-off.
Conclusion and next steps¶
We have shown how you can evaluate a Vespa application on two different levels.
- Evaluate retrieval (match-phase) using the
VespaMatchEvaluator
class. Here we checked recall, and - Evaluate ranking strategies using
VespaEvaluator
class. We have defined and compared 4 different ranking strategies in terms of both ranking quality and searchtime latency.
We hope this can provide you with a good starting point for evaluating your own Vespa application.
If you are ready to advance, you can try to optimize the ranking strategies further, by for example weighing each of the terms in the atan_norm
strategy differently (a * normalize_linear(normalized_bm25) + (1-a) * normalize_linear(cos_sim)
) , or by adding a crossencoder for re-ranking the top-k results.
Cleanup¶
vespa_cloud.delete()
Deactivated vespa-team.evaluation in dev.aws-us-east-1c Deleted instance vespa-team.evaluation.default