= MatchRatio() metric
evaluation
Metrics
Abstract and concrete classes related to evaluation metrics.
EvalMetric
EvalMetric ()
Abstract class for evaluation metric.
EvalMetric.evaluate_query
EvalMetric.evaluate_query (query_results, relevant_docs, id_field, default_score, detailed_metrics=False)
Abstract method to be implemented by metrics inheriting from EvalMetric
to evaluate query results.
Type | Default | Details | |
---|---|---|---|
query_results | Raw query results returned by Vespa. | ||
relevant_docs | Each dict contains a doc id a optionally a doc score. | ||
id_field | The Vespa field representing the document id. | ||
default_score | Score to assign to the additional documents that are not relevant. Default to 0. | ||
detailed_metrics | bool | False | Return intermediate computations if available. |
Returns | typing.Dict | Metric values. |
MatchRatio
MatchRatio ()
Computes the ratio of documents retrieved by the match phase.
Instantiate the metric:
MatchRatio.evaluate_query
MatchRatio.evaluate_query (query_results:vespa.io.VespaQueryResponse, relevant_docs:List[Dict], id_field:str, default_score:int, detailed_metrics=False)
Evaluate query results according to match ratio metric.
Type | Default | Details | |
---|---|---|---|
query_results | VespaQueryResponse | Raw query results returned by Vespa. | |
relevant_docs | typing.List[typing.Dict] | Each dict contains a doc id a optionally a doc score. | |
id_field | str | The Vespa field representing the document id. | |
default_score | int | Score to assign to the additional documents that are not relevant. Default to 0. | |
detailed_metrics | bool | False | Return intermediate computations if available. |
Returns | typing.Dict | Returns the match ratio. In addition, if detailed_metrics=False , returns the number of retrieved docs _retrieved_docs and the number of docs available in the corpus _docs_available . |
Compute match ratio:
= metric.evaluate_query(
evaluation =query_results,
query_results=None,
relevant_docs="vespa_id_field",
id_field=0,
default_score
) evaluation
{'match_ratio': 0.01731996353691887}
Return detailed metrics, in addition to match ratio:
= metric.evaluate_query(
evaluation =query_results,
query_results=None,
relevant_docs="vespa_id_field",
id_field=0,
default_score=True,
detailed_metrics
) evaluation
{'match_ratio': 0.01731996353691887,
'match_ratio_retrieved_docs': 1083,
'match_ratio_docs_available': 62529}
TimeQuery
TimeQuery ()
Compute the time it takes for Vespa to execute the query..
Instantiate the metric:
= TimeQuery() time_metric
TimeQuery.evaluate_query
TimeQuery.evaluate_query (query_results:vespa.io.VespaQueryResponse, relevant_docs:List[Dict], id_field:str, default_score:int, detailed_metrics=False)
Evaluate query results according to query time metric.
Type | Default | Details | |
---|---|---|---|
query_results | VespaQueryResponse | Raw query results returned by Vespa. | |
relevant_docs | typing.List[typing.Dict] | Each dict contains a doc id a optionally a doc score. | |
id_field | str | The Vespa field representing the document id. | |
default_score | int | Score to assign to the additional documents that are not relevant. Default to 0. | |
detailed_metrics | bool | False | Return intermediate computations if available. |
Returns | typing.Dict | Returns the match ratio. In addition, if detailed_metrics=False , returns the number of retrieved docs _retrieved_docs and the number of docs available in the corpus _docs_available . |
Compute the query time a client would observe (except network latency).
time_metric.evaluate_query(=query_results,
query_results=None,
relevant_docs="vespa_id_field",
id_field=0
default_score )
{'search_time': 0.013}
Include detailed metrics. In addition to the search_time
above, it returns the time to execute the first protocol phase/matching phase (search_time_query_time
) and the time to execute the summary fill protocol phase for the globally ordered top-k hits (search_time_summary_fetch_time
).
time_metric.evaluate_query(=query_results,
query_results=None,
relevant_docs="vespa_id_field",
id_field=0,
default_score=True
detailed_metrics )
{'search_time': 0.013,
'search_time_query_time': 0.01,
'search_time_summary_fetch_time': 0.002}
Recall
Recall (at:int)
Compute the recall at position at
.
Type | Details | |
---|---|---|
at | int | Maximum position on the resulting list to look for relevant docs. |
Returns | None |
Instantiate the metric:
= Recall(at=1)
recall_1 = Recall(at=2)
recall_2 = Recall(at=3) recall_3
Recall.evaluate_query
Recall.evaluate_query (query_results:vespa.io.VespaQueryResponse, relevant_docs:List[Dict], id_field:str, default_score:int, detailed_metrics=False)
Evaluate query results according to recall metric.
There is an assumption that only documents with score > 0 are relevant. Recall is equal to zero in case no relevant documents with score > 0 is provided.
Type | Default | Details | |
---|---|---|---|
query_results | VespaQueryResponse | Raw query results returned by Vespa. | |
relevant_docs | typing.List[typing.Dict] | Each dict contains a doc id a optionally a doc score. | |
id_field | str | The Vespa field representing the document id. | |
default_score | int | Score to assign to the additional documents that are not relevant. Default to 0. | |
detailed_metrics | bool | False | Return intermediate computations if available. |
Returns | typing.Dict | Returns the recall value. |
Compute recall:
= recall_2.evaluate_query(
evaluation =query_results,
query_results=relevant_docs,
relevant_docs="vespa_id_field",
id_field=0,
default_score
) evaluation
{'recall_2': 0.5}
Compute recall:
ReciprocalRank
ReciprocalRank (at:int)
Compute the reciprocal rank at position at
Type | Details | |
---|---|---|
at | int | Maximum position on the resulting list to look for relevant docs. |
Instantiate the metric:
= ReciprocalRank(at=1)
rr_1 = ReciprocalRank(at=2)
rr_2 = ReciprocalRank(at=3) rr_3
ReciprocalRank.evaluate_query
ReciprocalRank.evaluate_query (query_results:vespa.io.VespaQueryResponse, relevant_docs:List[Dict], id_field:str, default_score:int, detailed_metrics=False)
Evaluate query results according to reciprocal rank metric.
There is an assumption that only documents with score > 0 are relevant.
Type | Default | Details | |
---|---|---|---|
query_results | VespaQueryResponse | Raw query results returned by Vespa. | |
relevant_docs | typing.List[typing.Dict] | Each dict contains a doc id a optionally a doc score. | |
id_field | str | The Vespa field representing the document id. | |
default_score | int | Score to assign to the additional documents that are not relevant. Default to 0. | |
detailed_metrics | bool | False | Return intermediate computations if available. |
Returns | typing.Dict | Returns the reciprocal rank value. |
Compute reciprocal rank:
= rr_2.evaluate_query(
evaluation =query_results,
query_results=relevant_docs,
relevant_docs="vespa_id_field",
id_field=0,
default_score
) evaluation
{'reciprocal_rank_2': 0.5}
NormalizedDiscountedCumulativeGain
NormalizedDiscountedCumulativeGain (at:int)
Compute the normalized discounted cumulative gain at position at
.
Type | Details | |
---|---|---|
at | int | Maximum position on the resulting list to look for relevant docs. |
Instantiate the metric:
= NormalizedDiscountedCumulativeGain(at=1)
ndcg_1 = NormalizedDiscountedCumulativeGain(at=2)
ndcg_2 = NormalizedDiscountedCumulativeGain(at=3) ndcg_3
NormalizedDiscountedCumulativeGain.evaluate_query
NormalizedDiscountedCumulativeGain.evaluate_query (query_results:vespa.i o.VespaQueryResponse, relevant_docs:List[Dic t], id_field:str, default_score:int, det ailed_metrics=False)
Evaluate query results according to normalized discounted cumulative gain.
There is an assumption that documents returned by the query that are not included in the set of relevant documents have score equal to zero. Similarly, if the query returns a number N < at
documents, we will assume that those N - at
missing scores are equal to zero.
Type | Default | Details | |
---|---|---|---|
query_results | VespaQueryResponse | Raw query results returned by Vespa. | |
relevant_docs | typing.List[typing.Dict] | Each dict contains a doc id a optionally a doc score. | |
id_field | str | The Vespa field representing the document id. | |
default_score | int | Score to assign to the additional documents that are not relevant. Default to 0. | |
detailed_metrics | bool | False | Return intermediate computations if available. |
Returns | typing.Dict | Returns the normalized discounted cumulative gain. In addition, if detailed_metrics=False , returns the ideal discounted cumulative gain _ideal_dcg , the discounted cumulative gain _dcg . |
Compute NDCG:
= NormalizedDiscountedCumulativeGain(at=2)
metric = ndcg_2.evaluate_query(
evaluation =query_results,
query_results=relevant_docs,
relevant_docs="vespa_id_field",
id_field=0,
default_score
) evaluation
{'ndcg_2': 0.38685280723454163}
Return detailed metrics, in addition to NDCG:
= ndcg_2.evaluate_query(
evaluation =query_results,
query_results=relevant_docs,
relevant_docs="vespa_id_field",
id_field=0,
default_score=True,
detailed_metrics
) evaluation
{'ndcg_2': 0.38685280723454163,
'ndcg_2_ideal_dcg': 1.6309297535714575,
'ndcg_2_dcg': 0.6309297535714575}
Evaluation queries in batch
evaluate
evaluate (app:vespa.application.Vespa, labeled_data:Union[List[Dict],pandas.core.frame.DataFrame], eval_metrics:List[__main__.EvalMetric], query_model:Union[learn torank.query.QueryModel,List[learntorank.query.QueryModel]], id_field:str, default_score:int=0, detailed_metrics=False, per_query=False, aggregators=None, timeout=1000, **kwargs)
Evaluate a QueryModel
according to a list of EvalMetric
.
Type | Default | Details | |
---|---|---|---|
app | Vespa | Connection to a Vespa application. | |
labeled_data | typing.Union[typing.List[typing.Dict], pandas.core.frame.DataFrame] | Data containing query, query_id and relevant docs. See examples below for format. | |
eval_metrics | typing.List[main.EvalMetric] | Evaluation metrics | |
query_model | typing.Union[learntorank.query.QueryModel, typing.List[learntorank.query.QueryModel]] | Query models to be evaluated | |
id_field | str | The Vespa field representing the document id. | |
default_score | int | 0 | Score to assign to the additional documents that are not relevant. |
detailed_metrics | bool | False | Return intermediate computations if available. |
per_query | bool | False | Set to True to return evaluation metrics per query. |
aggregators | NoneType | None | Used only if per_query=False . List of pandas friendly aggregators to summarize per model metrics. We use [“mean”, “median”, “std”] by default. |
timeout | int | 1000 | Vespa query timeout in ms. |
kwargs | |||
Returns | DataFrame | Returns query_id and metrics according to the selected evaluation metrics. |
Usage:
Setup and feed a Vespa application:
from learntorank.passage import create_basic_search_package
from learntorank.passage import PassageData
from vespa.deployment import VespaDocker
= create_basic_search_package(name="EvaluationApp")
app_package = VespaDocker(port=8082, cfgsrv_port=19072)
vespa_docker = vespa_docker.deploy(application_package=app_package)
app = PassageData.load()
data = app.feed_df(
responses =data.get_corpus(),
df=True,
include_id="doc_id"
id_field )
Define query models to be evaluated:
from learntorank.query import OR, Ranking
= QueryModel(
bm25_query_model ="bm25",
name=OR(),
match_phase=Ranking(name="bm25")
ranking
)= QueryModel(
native_query_model ="native_rank",
name=OR(),
match_phase=Ranking(name="native_rank")
ranking )
Define metrics to compute during evaluation:
= [
metrics =10),
Recall(at=3),
ReciprocalRank(at=3)
NormalizedDiscountedCumulativeGain(at ]
Get labeled data:
= data.get_labels(type="dev")
labeled_data 0:2] labeled_data[
[{'query_id': '1101971',
'query': 'why say the sky is the limit',
'relevant_docs': [{'id': '7407715', 'score': 1}]},
{'query_id': '712898',
'query': 'what is an cvc in radiology',
'relevant_docs': [{'id': '7661336', 'score': 1}]}]
Evaluate:
= evaluate(
evaluation =app,
app=labeled_data,
labeled_data=metrics,
eval_metrics=[native_query_model, bm25_query_model],
query_model="doc_id",
id_field
) evaluation
model | bm25 | native_rank | |
---|---|---|---|
recall_10 | mean | 0.935833 | 0.845833 |
median | 1.000000 | 1.000000 | |
std | 0.215444 | 0.342749 | |
reciprocal_rank_3 | mean | 0.935000 | 0.746667 |
median | 1.000000 | 1.000000 | |
std | 0.231977 | 0.399551 | |
ndcg_3 | mean | 0.912839 | 0.740814 |
median | 1.000000 | 1.000000 | |
std | 0.242272 | 0.387611 |
The evaluate function also accepts labeled data as a data frame:
labeled_df.head()
qid | query | doc_id | relevance | |
---|---|---|---|---|
0 | 1101971 | why say the sky is the limit | 7407715 | 1 |
1 | 712898 | what is an cvc in radiology | 7661336 | 1 |
2 | 154469 | dmv california how long does it take to get id | 7914544 | 1 |
3 | 930015 | what's an epigraph | 7928705 | 1 |
4 | 860085 | what is va tax | 2915383 | 1 |
= evaluate(
evaluation_df =app,
app=labeled_df,
labeled_data=metrics,
eval_metrics=[native_query_model, bm25_query_model],
query_model="doc_id",
id_field
) evaluation_df
model | bm25 | native_rank | |
---|---|---|---|
recall_10 | mean | 0.935833 | 0.845833 |
median | 1.000000 | 1.000000 | |
std | 0.215444 | 0.342749 | |
reciprocal_rank_3 | mean | 0.935000 | 0.746667 |
median | 1.000000 | 1.000000 | |
std | 0.231977 | 0.399551 | |
ndcg_3 | mean | 0.912839 | 0.740814 |
median | 1.000000 | 1.000000 | |
std | 0.242272 | 0.387611 |
Control which aggregators are computed:
= evaluate(
evaluation =app,
app=labeled_data,
labeled_data=metrics,
eval_metrics=[native_query_model, bm25_query_model],
query_model="doc_id",
id_field=["mean", "std"]
aggregators
) evaluation
model | bm25 | native_rank | |
---|---|---|---|
recall_10 | mean | 0.935833 | 0.845833 |
std | 0.215444 | 0.342749 | |
reciprocal_rank_3 | mean | 0.935000 | 0.746667 |
std | 0.231977 | 0.399551 | |
ndcg_3 | mean | 0.912839 | 0.740814 |
std | 0.242272 | 0.387611 |
Include detailed metrics when available, this includes intermediate steps that are available for some of the metrics:
= evaluate(
evaluation =app,
app=labeled_data,
labeled_data=metrics,
eval_metrics=[native_query_model, bm25_query_model],
query_model="doc_id",
id_field=["mean", "std"],
aggregators=True
detailed_metrics
) evaluation
model | bm25 | native_rank | |
---|---|---|---|
recall_10 | mean | 0.935833 | 0.845833 |
std | 0.215444 | 0.342749 | |
reciprocal_rank_3 | mean | 0.935000 | 0.746667 |
std | 0.231977 | 0.399551 | |
ndcg_3 | mean | 0.912839 | 0.740814 |
std | 0.242272 | 0.387611 | |
ndcg_3_ideal_dcg | mean | 1.054165 | 1.054165 |
std | 0.207315 | 0.207315 | |
ndcg_3_dcg | mean | 0.938928 | 0.765474 |
std | 0.225533 | 0.387161 |
Generate results per query:
= evaluate(
evaluation =app,
app=labeled_data,
labeled_data=metrics,
eval_metrics=[native_query_model, bm25_query_model],
query_model="doc_id",
id_field=True
per_query
) evaluation.head()
model | query_id | recall_10 | reciprocal_rank_3 | ndcg_3 | |
---|---|---|---|---|---|
0 | native_rank | 1101971 | 1.0 | 1.0 | 1.0 |
1 | native_rank | 712898 | 0.0 | 0.0 | 0.0 |
2 | native_rank | 154469 | 1.0 | 0.0 | 0.0 |
3 | native_rank | 930015 | 1.0 | 0.0 | 0.0 |
4 | native_rank | 860085 | 0.0 | 0.0 | 0.0 |
Evaluate specific query
evaluate_query
evaluate_query (app:vespa.application.Vespa, eval_metrics:List[__main__.EvalMetric], query_model:learntorank.query.QueryModel, query_id:str, query:str, id_field:str, relevant_docs:List[Dict], default_score:int=0, detailed_metrics=False, **kwargs)
Evaluate a single query according to evaluation metrics
Type | Default | Details | |
---|---|---|---|
app | Vespa | Connection to a Vespa application. | |
eval_metrics | typing.List[main.EvalMetric] | Evaluation metrics | |
query_model | QueryModel | Query model to be evaluated | |
query_id | str | Query id represented as str. | |
query | str | Query string. | |
id_field | str | The Vespa field representing the document id. | |
relevant_docs | typing.List[typing.Dict] | Each dict contains a doc id a optionally a doc score. | |
default_score | int | 0 | Score to assign to the additional documents that are not relevant. |
detailed_metrics | bool | False | Return intermediate computations if available. |
kwargs | |||
Returns | typing.Dict | Contains query_id and metrics according to the selected evaluation metrics. |
Usage:
= Vespa(url = "https://api.cord19.vespa.ai")
app = QueryModel(
query_model = OR(),
match_phase = Ranking(name="bm25", list_features=True)) ranking
Evaluate a single query:
= evaluate_query(
query_evaluation =app,
app= eval_metrics,
eval_metrics = bm25_query_model,
query_model = "0",
query_id = "Intrauterine virus infections and congenital heart disease",
query = "id",
id_field = [{"id": 0, "score": 1}, {"id": 3, "score": 1}],
relevant_docs = 0
default_score
) query_evaluation
{'model': 'bm25',
'query_id': '0',
'match_ratio': 0.814424921006077,
'recall_10': 0.0,
'reciprocal_rank_10': 0}
Evaluate query under specific document ids
Use recall
to specify which documents should be included in the evaluation.
In the example below, we include documents with id equal to 0, 1 and 2. Since the relevant documents for this query are the documents with id 0 and 3, we should get recall equal to 0.5.
= evaluate_query(
query_evaluation =app,
app= eval_metrics,
eval_metrics = query_model,
query_model = 0,
query_id = "Intrauterine virus infections and congenital heart disease",
query = "id",
id_field = [{"id": 0, "score": 1}, {"id": 3, "score": 1}],
relevant_docs = 0,
default_score = ("id", [0, 1, 2])
recall
) query_evaluation
{'model': 'default_name',
'query_id': 0,
'match_ratio': 9.70242657688688e-06,
'recall_10': 0.5,
'reciprocal_rank_10': 1.0}
We now include documents with id equal to 0, 1, 2 and 3. This should give a recall equal to 1.
= evaluate_query(
query_evaluation =app,
app= eval_metrics,
eval_metrics = query_model,
query_model = 0,
query_id = "Intrauterine virus infections and congenital heart disease",
query = "id",
id_field = [{"id": 0, "score": 1}, {"id": 3, "score": 1}],
relevant_docs = 0,
default_score = ("id", [0, 1, 2, 3])
recall
) query_evaluation
{'model': 'default_name',
'query_id': 0,
'match_ratio': 1.2936568769182506e-05,
'recall_10': 1.0,
'reciprocal_rank_10': 1.0}