metric = MatchRatio()evaluation
Metrics
Abstract and concrete classes related to evaluation metrics.
EvalMetric
EvalMetric ()
Abstract class for evaluation metric.
EvalMetric.evaluate_query
EvalMetric.evaluate_query (query_results, relevant_docs, id_field, default_score, detailed_metrics=False)
Abstract method to be implemented by metrics inheriting from EvalMetric to evaluate query results.
| Type | Default | Details | |
|---|---|---|---|
| query_results | Raw query results returned by Vespa. | ||
| relevant_docs | Each dict contains a doc id a optionally a doc score. | ||
| id_field | The Vespa field representing the document id. | ||
| default_score | Score to assign to the additional documents that are not relevant. Default to 0. | ||
| detailed_metrics | bool | False | Return intermediate computations if available. |
| Returns | typing.Dict | Metric values. |
MatchRatio
MatchRatio ()
Computes the ratio of documents retrieved by the match phase.
Instantiate the metric:
MatchRatio.evaluate_query
MatchRatio.evaluate_query (query_results:vespa.io.VespaQueryResponse, relevant_docs:List[Dict], id_field:str, default_score:int, detailed_metrics=False)
Evaluate query results according to match ratio metric.
| Type | Default | Details | |
|---|---|---|---|
| query_results | VespaQueryResponse | Raw query results returned by Vespa. | |
| relevant_docs | typing.List[typing.Dict] | Each dict contains a doc id a optionally a doc score. | |
| id_field | str | The Vespa field representing the document id. | |
| default_score | int | Score to assign to the additional documents that are not relevant. Default to 0. | |
| detailed_metrics | bool | False | Return intermediate computations if available. |
| Returns | typing.Dict | Returns the match ratio. In addition, if detailed_metrics=False, returns the number of retrieved docs _retrieved_docs and the number of docs available in the corpus _docs_available. |
Compute match ratio:
evaluation = metric.evaluate_query(
query_results=query_results,
relevant_docs=None,
id_field="vespa_id_field",
default_score=0,
)
evaluation{'match_ratio': 0.01731996353691887}
Return detailed metrics, in addition to match ratio:
evaluation = metric.evaluate_query(
query_results=query_results,
relevant_docs=None,
id_field="vespa_id_field",
default_score=0,
detailed_metrics=True,
)
evaluation{'match_ratio': 0.01731996353691887,
'match_ratio_retrieved_docs': 1083,
'match_ratio_docs_available': 62529}
TimeQuery
TimeQuery ()
Compute the time it takes for Vespa to execute the query..
Instantiate the metric:
time_metric = TimeQuery()TimeQuery.evaluate_query
TimeQuery.evaluate_query (query_results:vespa.io.VespaQueryResponse, relevant_docs:List[Dict], id_field:str, default_score:int, detailed_metrics=False)
Evaluate query results according to query time metric.
| Type | Default | Details | |
|---|---|---|---|
| query_results | VespaQueryResponse | Raw query results returned by Vespa. | |
| relevant_docs | typing.List[typing.Dict] | Each dict contains a doc id a optionally a doc score. | |
| id_field | str | The Vespa field representing the document id. | |
| default_score | int | Score to assign to the additional documents that are not relevant. Default to 0. | |
| detailed_metrics | bool | False | Return intermediate computations if available. |
| Returns | typing.Dict | Returns the match ratio. In addition, if detailed_metrics=False, returns the number of retrieved docs _retrieved_docs and the number of docs available in the corpus _docs_available. |
Compute the query time a client would observe (except network latency).
time_metric.evaluate_query(
query_results=query_results,
relevant_docs=None,
id_field="vespa_id_field",
default_score=0
){'search_time': 0.013}
Include detailed metrics. In addition to the search_time above, it returns the time to execute the first protocol phase/matching phase (search_time_query_time) and the time to execute the summary fill protocol phase for the globally ordered top-k hits (search_time_summary_fetch_time).
time_metric.evaluate_query(
query_results=query_results,
relevant_docs=None,
id_field="vespa_id_field",
default_score=0,
detailed_metrics=True
){'search_time': 0.013,
'search_time_query_time': 0.01,
'search_time_summary_fetch_time': 0.002}
Recall
Recall (at:int)
Compute the recall at position at.
| Type | Details | |
|---|---|---|
| at | int | Maximum position on the resulting list to look for relevant docs. |
| Returns | None |
Instantiate the metric:
recall_1 = Recall(at=1)
recall_2 = Recall(at=2)
recall_3 = Recall(at=3)Recall.evaluate_query
Recall.evaluate_query (query_results:vespa.io.VespaQueryResponse, relevant_docs:List[Dict], id_field:str, default_score:int, detailed_metrics=False)
Evaluate query results according to recall metric.
There is an assumption that only documents with score > 0 are relevant. Recall is equal to zero in case no relevant documents with score > 0 is provided.
| Type | Default | Details | |
|---|---|---|---|
| query_results | VespaQueryResponse | Raw query results returned by Vespa. | |
| relevant_docs | typing.List[typing.Dict] | Each dict contains a doc id a optionally a doc score. | |
| id_field | str | The Vespa field representing the document id. | |
| default_score | int | Score to assign to the additional documents that are not relevant. Default to 0. | |
| detailed_metrics | bool | False | Return intermediate computations if available. |
| Returns | typing.Dict | Returns the recall value. |
Compute recall:
evaluation = recall_2.evaluate_query(
query_results=query_results,
relevant_docs=relevant_docs,
id_field="vespa_id_field",
default_score=0,
)
evaluation{'recall_2': 0.5}
Compute recall:
ReciprocalRank
ReciprocalRank (at:int)
Compute the reciprocal rank at position at
| Type | Details | |
|---|---|---|
| at | int | Maximum position on the resulting list to look for relevant docs. |
Instantiate the metric:
rr_1 = ReciprocalRank(at=1)
rr_2 = ReciprocalRank(at=2)
rr_3 = ReciprocalRank(at=3)ReciprocalRank.evaluate_query
ReciprocalRank.evaluate_query (query_results:vespa.io.VespaQueryResponse, relevant_docs:List[Dict], id_field:str, default_score:int, detailed_metrics=False)
Evaluate query results according to reciprocal rank metric.
There is an assumption that only documents with score > 0 are relevant.
| Type | Default | Details | |
|---|---|---|---|
| query_results | VespaQueryResponse | Raw query results returned by Vespa. | |
| relevant_docs | typing.List[typing.Dict] | Each dict contains a doc id a optionally a doc score. | |
| id_field | str | The Vespa field representing the document id. | |
| default_score | int | Score to assign to the additional documents that are not relevant. Default to 0. | |
| detailed_metrics | bool | False | Return intermediate computations if available. |
| Returns | typing.Dict | Returns the reciprocal rank value. |
Compute reciprocal rank:
evaluation = rr_2.evaluate_query(
query_results=query_results,
relevant_docs=relevant_docs,
id_field="vespa_id_field",
default_score=0,
)
evaluation{'reciprocal_rank_2': 0.5}
NormalizedDiscountedCumulativeGain
NormalizedDiscountedCumulativeGain (at:int)
Compute the normalized discounted cumulative gain at position at.
| Type | Details | |
|---|---|---|
| at | int | Maximum position on the resulting list to look for relevant docs. |
Instantiate the metric:
ndcg_1 = NormalizedDiscountedCumulativeGain(at=1)
ndcg_2 = NormalizedDiscountedCumulativeGain(at=2)
ndcg_3 = NormalizedDiscountedCumulativeGain(at=3)NormalizedDiscountedCumulativeGain.evaluate_query
NormalizedDiscountedCumulativeGain.evaluate_query (query_results:vespa.i o.VespaQueryResponse, relevant_docs:List[Dic t], id_field:str, default_score:int, det ailed_metrics=False)
Evaluate query results according to normalized discounted cumulative gain.
There is an assumption that documents returned by the query that are not included in the set of relevant documents have score equal to zero. Similarly, if the query returns a number N < at documents, we will assume that those N - at missing scores are equal to zero.
| Type | Default | Details | |
|---|---|---|---|
| query_results | VespaQueryResponse | Raw query results returned by Vespa. | |
| relevant_docs | typing.List[typing.Dict] | Each dict contains a doc id a optionally a doc score. | |
| id_field | str | The Vespa field representing the document id. | |
| default_score | int | Score to assign to the additional documents that are not relevant. Default to 0. | |
| detailed_metrics | bool | False | Return intermediate computations if available. |
| Returns | typing.Dict | Returns the normalized discounted cumulative gain. In addition, if detailed_metrics=False, returns the ideal discounted cumulative gain _ideal_dcg, the discounted cumulative gain _dcg. |
Compute NDCG:
metric = NormalizedDiscountedCumulativeGain(at=2)
evaluation = ndcg_2.evaluate_query(
query_results=query_results,
relevant_docs=relevant_docs,
id_field="vespa_id_field",
default_score=0,
)
evaluation{'ndcg_2': 0.38685280723454163}
Return detailed metrics, in addition to NDCG:
evaluation = ndcg_2.evaluate_query(
query_results=query_results,
relevant_docs=relevant_docs,
id_field="vespa_id_field",
default_score=0,
detailed_metrics=True,
)
evaluation{'ndcg_2': 0.38685280723454163,
'ndcg_2_ideal_dcg': 1.6309297535714575,
'ndcg_2_dcg': 0.6309297535714575}
Evaluation queries in batch
evaluate
evaluate (app:vespa.application.Vespa, labeled_data:Union[List[Dict],pandas.core.frame.DataFrame], eval_metrics:List[__main__.EvalMetric], query_model:Union[learn torank.query.QueryModel,List[learntorank.query.QueryModel]], id_field:str, default_score:int=0, detailed_metrics=False, per_query=False, aggregators=None, timeout=1000, **kwargs)
Evaluate a QueryModel according to a list of EvalMetric.
| Type | Default | Details | |
|---|---|---|---|
| app | Vespa | Connection to a Vespa application. | |
| labeled_data | typing.Union[typing.List[typing.Dict], pandas.core.frame.DataFrame] | Data containing query, query_id and relevant docs. See examples below for format. | |
| eval_metrics | typing.List[main.EvalMetric] | Evaluation metrics | |
| query_model | typing.Union[learntorank.query.QueryModel, typing.List[learntorank.query.QueryModel]] | Query models to be evaluated | |
| id_field | str | The Vespa field representing the document id. | |
| default_score | int | 0 | Score to assign to the additional documents that are not relevant. |
| detailed_metrics | bool | False | Return intermediate computations if available. |
| per_query | bool | False | Set to True to return evaluation metrics per query. |
| aggregators | NoneType | None | Used only if per_query=False. List of pandas friendly aggregators to summarize per model metrics. We use [“mean”, “median”, “std”] by default. |
| timeout | int | 1000 | Vespa query timeout in ms. |
| kwargs | |||
| Returns | DataFrame | Returns query_id and metrics according to the selected evaluation metrics. |
Usage:
Setup and feed a Vespa application:
from learntorank.passage import create_basic_search_package
from learntorank.passage import PassageData
from vespa.deployment import VespaDockerapp_package = create_basic_search_package(name="EvaluationApp")
vespa_docker = VespaDocker(port=8082, cfgsrv_port=19072)
app = vespa_docker.deploy(application_package=app_package)
data = PassageData.load()
responses = app.feed_df(
df=data.get_corpus(),
include_id=True,
id_field="doc_id"
)Define query models to be evaluated:
from learntorank.query import OR, Rankingbm25_query_model = QueryModel(
name="bm25",
match_phase=OR(),
ranking=Ranking(name="bm25")
)
native_query_model = QueryModel(
name="native_rank",
match_phase=OR(),
ranking=Ranking(name="native_rank")
)Define metrics to compute during evaluation:
metrics = [
Recall(at=10),
ReciprocalRank(at=3),
NormalizedDiscountedCumulativeGain(at=3)
]Get labeled data:
labeled_data = data.get_labels(type="dev")
labeled_data[0:2][{'query_id': '1101971',
'query': 'why say the sky is the limit',
'relevant_docs': [{'id': '7407715', 'score': 1}]},
{'query_id': '712898',
'query': 'what is an cvc in radiology',
'relevant_docs': [{'id': '7661336', 'score': 1}]}]
Evaluate:
evaluation = evaluate(
app=app,
labeled_data=labeled_data,
eval_metrics=metrics,
query_model=[native_query_model, bm25_query_model],
id_field="doc_id",
)
evaluation| model | bm25 | native_rank | |
|---|---|---|---|
| recall_10 | mean | 0.935833 | 0.845833 |
| median | 1.000000 | 1.000000 | |
| std | 0.215444 | 0.342749 | |
| reciprocal_rank_3 | mean | 0.935000 | 0.746667 |
| median | 1.000000 | 1.000000 | |
| std | 0.231977 | 0.399551 | |
| ndcg_3 | mean | 0.912839 | 0.740814 |
| median | 1.000000 | 1.000000 | |
| std | 0.242272 | 0.387611 |
The evaluate function also accepts labeled data as a data frame:
labeled_df.head()| qid | query | doc_id | relevance | |
|---|---|---|---|---|
| 0 | 1101971 | why say the sky is the limit | 7407715 | 1 |
| 1 | 712898 | what is an cvc in radiology | 7661336 | 1 |
| 2 | 154469 | dmv california how long does it take to get id | 7914544 | 1 |
| 3 | 930015 | what's an epigraph | 7928705 | 1 |
| 4 | 860085 | what is va tax | 2915383 | 1 |
evaluation_df = evaluate(
app=app,
labeled_data=labeled_df,
eval_metrics=metrics,
query_model=[native_query_model, bm25_query_model],
id_field="doc_id",
)
evaluation_df| model | bm25 | native_rank | |
|---|---|---|---|
| recall_10 | mean | 0.935833 | 0.845833 |
| median | 1.000000 | 1.000000 | |
| std | 0.215444 | 0.342749 | |
| reciprocal_rank_3 | mean | 0.935000 | 0.746667 |
| median | 1.000000 | 1.000000 | |
| std | 0.231977 | 0.399551 | |
| ndcg_3 | mean | 0.912839 | 0.740814 |
| median | 1.000000 | 1.000000 | |
| std | 0.242272 | 0.387611 |
Control which aggregators are computed:
evaluation = evaluate(
app=app,
labeled_data=labeled_data,
eval_metrics=metrics,
query_model=[native_query_model, bm25_query_model],
id_field="doc_id",
aggregators=["mean", "std"]
)
evaluation| model | bm25 | native_rank | |
|---|---|---|---|
| recall_10 | mean | 0.935833 | 0.845833 |
| std | 0.215444 | 0.342749 | |
| reciprocal_rank_3 | mean | 0.935000 | 0.746667 |
| std | 0.231977 | 0.399551 | |
| ndcg_3 | mean | 0.912839 | 0.740814 |
| std | 0.242272 | 0.387611 |
Include detailed metrics when available, this includes intermediate steps that are available for some of the metrics:
evaluation = evaluate(
app=app,
labeled_data=labeled_data,
eval_metrics=metrics,
query_model=[native_query_model, bm25_query_model],
id_field="doc_id",
aggregators=["mean", "std"],
detailed_metrics=True
)
evaluation| model | bm25 | native_rank | |
|---|---|---|---|
| recall_10 | mean | 0.935833 | 0.845833 |
| std | 0.215444 | 0.342749 | |
| reciprocal_rank_3 | mean | 0.935000 | 0.746667 |
| std | 0.231977 | 0.399551 | |
| ndcg_3 | mean | 0.912839 | 0.740814 |
| std | 0.242272 | 0.387611 | |
| ndcg_3_ideal_dcg | mean | 1.054165 | 1.054165 |
| std | 0.207315 | 0.207315 | |
| ndcg_3_dcg | mean | 0.938928 | 0.765474 |
| std | 0.225533 | 0.387161 |
Generate results per query:
evaluation = evaluate(
app=app,
labeled_data=labeled_data,
eval_metrics=metrics,
query_model=[native_query_model, bm25_query_model],
id_field="doc_id",
per_query=True
)
evaluation.head()| model | query_id | recall_10 | reciprocal_rank_3 | ndcg_3 | |
|---|---|---|---|---|---|
| 0 | native_rank | 1101971 | 1.0 | 1.0 | 1.0 |
| 1 | native_rank | 712898 | 0.0 | 0.0 | 0.0 |
| 2 | native_rank | 154469 | 1.0 | 0.0 | 0.0 |
| 3 | native_rank | 930015 | 1.0 | 0.0 | 0.0 |
| 4 | native_rank | 860085 | 0.0 | 0.0 | 0.0 |
Evaluate specific query
evaluate_query
evaluate_query (app:vespa.application.Vespa, eval_metrics:List[__main__.EvalMetric], query_model:learntorank.query.QueryModel, query_id:str, query:str, id_field:str, relevant_docs:List[Dict], default_score:int=0, detailed_metrics=False, **kwargs)
Evaluate a single query according to evaluation metrics
| Type | Default | Details | |
|---|---|---|---|
| app | Vespa | Connection to a Vespa application. | |
| eval_metrics | typing.List[main.EvalMetric] | Evaluation metrics | |
| query_model | QueryModel | Query model to be evaluated | |
| query_id | str | Query id represented as str. | |
| query | str | Query string. | |
| id_field | str | The Vespa field representing the document id. | |
| relevant_docs | typing.List[typing.Dict] | Each dict contains a doc id a optionally a doc score. | |
| default_score | int | 0 | Score to assign to the additional documents that are not relevant. |
| detailed_metrics | bool | False | Return intermediate computations if available. |
| kwargs | |||
| Returns | typing.Dict | Contains query_id and metrics according to the selected evaluation metrics. |
Usage:
app = Vespa(url = "https://api.cord19.vespa.ai")
query_model = QueryModel(
match_phase = OR(),
ranking = Ranking(name="bm25", list_features=True))Evaluate a single query:
query_evaluation = evaluate_query(
app=app,
eval_metrics = eval_metrics,
query_model = bm25_query_model,
query_id = "0",
query = "Intrauterine virus infections and congenital heart disease",
id_field = "id",
relevant_docs = [{"id": 0, "score": 1}, {"id": 3, "score": 1}],
default_score = 0
)
query_evaluation{'model': 'bm25',
'query_id': '0',
'match_ratio': 0.814424921006077,
'recall_10': 0.0,
'reciprocal_rank_10': 0}
Evaluate query under specific document ids
Use recall to specify which documents should be included in the evaluation.
In the example below, we include documents with id equal to 0, 1 and 2. Since the relevant documents for this query are the documents with id 0 and 3, we should get recall equal to 0.5.
query_evaluation = evaluate_query(
app=app,
eval_metrics = eval_metrics,
query_model = query_model,
query_id = 0,
query = "Intrauterine virus infections and congenital heart disease",
id_field = "id",
relevant_docs = [{"id": 0, "score": 1}, {"id": 3, "score": 1}],
default_score = 0,
recall = ("id", [0, 1, 2])
)
query_evaluation{'model': 'default_name',
'query_id': 0,
'match_ratio': 9.70242657688688e-06,
'recall_10': 0.5,
'reciprocal_rank_10': 1.0}
We now include documents with id equal to 0, 1, 2 and 3. This should give a recall equal to 1.
query_evaluation = evaluate_query(
app=app,
eval_metrics = eval_metrics,
query_model = query_model,
query_id = 0,
query = "Intrauterine virus infections and congenital heart disease",
id_field = "id",
relevant_docs = [{"id": 0, "score": 1}, {"id": 3, "score": 1}],
default_score = 0,
recall = ("id", [0, 1, 2, 3])
)
query_evaluation{'model': 'default_name',
'query_id': 0,
'match_ratio': 1.2936568769182506e-05,
'recall_10': 1.0,
'reciprocal_rank_10': 1.0}