evaluation

Reference API related to evaluation function and metrics

Metrics

Abstract and concrete classes related to evaluation metrics.


source

EvalMetric

 EvalMetric ()

Abstract class for evaluation metric.


source

EvalMetric.evaluate_query

 EvalMetric.evaluate_query (query_results, relevant_docs, id_field,
                            default_score, detailed_metrics=False)

Abstract method to be implemented by metrics inheriting from EvalMetric to evaluate query results.

Type Default Details
query_results Raw query results returned by Vespa.
relevant_docs Each dict contains a doc id a optionally a doc score.
id_field The Vespa field representing the document id.
default_score Score to assign to the additional documents that are not relevant. Default to 0.
detailed_metrics bool False Return intermediate computations if available.
Returns typing.Dict Metric values.

source

MatchRatio

 MatchRatio ()

Computes the ratio of documents retrieved by the match phase.

Instantiate the metric:

metric = MatchRatio()

source

MatchRatio.evaluate_query

 MatchRatio.evaluate_query (query_results:vespa.io.VespaQueryResponse,
                            relevant_docs:List[Dict], id_field:str,
                            default_score:int, detailed_metrics=False)

Evaluate query results according to match ratio metric.

Type Default Details
query_results VespaQueryResponse Raw query results returned by Vespa.
relevant_docs typing.List[typing.Dict] Each dict contains a doc id a optionally a doc score.
id_field str The Vespa field representing the document id.
default_score int Score to assign to the additional documents that are not relevant. Default to 0.
detailed_metrics bool False Return intermediate computations if available.
Returns typing.Dict Returns the match ratio. In addition, if detailed_metrics=False, returns the number of retrieved docs _retrieved_docs and the number of docs available in the corpus _docs_available.

Compute match ratio:

evaluation = metric.evaluate_query(
    query_results=query_results, 
    relevant_docs=None,
    id_field="vespa_id_field",
    default_score=0,
)
evaluation
{'match_ratio': 0.01731996353691887}

Return detailed metrics, in addition to match ratio:

evaluation = metric.evaluate_query(
    query_results=query_results,
    relevant_docs=None,
    id_field="vespa_id_field",
    default_score=0,
    detailed_metrics=True,
)
evaluation
{'match_ratio': 0.01731996353691887,
 'match_ratio_retrieved_docs': 1083,
 'match_ratio_docs_available': 62529}

source

TimeQuery

 TimeQuery ()

Compute the time it takes for Vespa to execute the query..

Instantiate the metric:

time_metric = TimeQuery()

source

TimeQuery.evaluate_query

 TimeQuery.evaluate_query (query_results:vespa.io.VespaQueryResponse,
                           relevant_docs:List[Dict], id_field:str,
                           default_score:int, detailed_metrics=False)

Evaluate query results according to query time metric.

Type Default Details
query_results VespaQueryResponse Raw query results returned by Vespa.
relevant_docs typing.List[typing.Dict] Each dict contains a doc id a optionally a doc score.
id_field str The Vespa field representing the document id.
default_score int Score to assign to the additional documents that are not relevant. Default to 0.
detailed_metrics bool False Return intermediate computations if available.
Returns typing.Dict Returns the match ratio. In addition, if detailed_metrics=False, returns the number of retrieved docs _retrieved_docs and the number of docs available in the corpus _docs_available.

Compute the query time a client would observe (except network latency).

time_metric.evaluate_query(
    query_results=query_results, 
    relevant_docs=None, 
    id_field="vespa_id_field",
    default_score=0
)
{'search_time': 0.013}

Include detailed metrics. In addition to the search_time above, it returns the time to execute the first protocol phase/matching phase (search_time_query_time) and the time to execute the summary fill protocol phase for the globally ordered top-k hits (search_time_summary_fetch_time).

time_metric.evaluate_query(
    query_results=query_results, 
    relevant_docs=None, 
    id_field="vespa_id_field",
    default_score=0,
    detailed_metrics=True
)
{'search_time': 0.013,
 'search_time_query_time': 0.01,
 'search_time_summary_fetch_time': 0.002}

source

Recall

 Recall (at:int)

Compute the recall at position at.

Type Details
at int Maximum position on the resulting list to look for relevant docs.
Returns None

Instantiate the metric:

recall_1 = Recall(at=1)
recall_2 = Recall(at=2)
recall_3 = Recall(at=3)

source

Recall.evaluate_query

 Recall.evaluate_query (query_results:vespa.io.VespaQueryResponse,
                        relevant_docs:List[Dict], id_field:str,
                        default_score:int, detailed_metrics=False)

Evaluate query results according to recall metric.

There is an assumption that only documents with score > 0 are relevant. Recall is equal to zero in case no relevant documents with score > 0 is provided.

Type Default Details
query_results VespaQueryResponse Raw query results returned by Vespa.
relevant_docs typing.List[typing.Dict] Each dict contains a doc id a optionally a doc score.
id_field str The Vespa field representing the document id.
default_score int Score to assign to the additional documents that are not relevant. Default to 0.
detailed_metrics bool False Return intermediate computations if available.
Returns typing.Dict Returns the recall value.

Compute recall:

evaluation = recall_2.evaluate_query(
    query_results=query_results,
    relevant_docs=relevant_docs,
    id_field="vespa_id_field",
    default_score=0,
)
evaluation
{'recall_2': 0.5}

Compute recall:


source

ReciprocalRank

 ReciprocalRank (at:int)

Compute the reciprocal rank at position at

Type Details
at int Maximum position on the resulting list to look for relevant docs.

Instantiate the metric:

rr_1 = ReciprocalRank(at=1)
rr_2 = ReciprocalRank(at=2)
rr_3 = ReciprocalRank(at=3)

source

ReciprocalRank.evaluate_query

 ReciprocalRank.evaluate_query (query_results:vespa.io.VespaQueryResponse,
                                relevant_docs:List[Dict], id_field:str,
                                default_score:int, detailed_metrics=False)

Evaluate query results according to reciprocal rank metric.

There is an assumption that only documents with score > 0 are relevant.

Type Default Details
query_results VespaQueryResponse Raw query results returned by Vespa.
relevant_docs typing.List[typing.Dict] Each dict contains a doc id a optionally a doc score.
id_field str The Vespa field representing the document id.
default_score int Score to assign to the additional documents that are not relevant. Default to 0.
detailed_metrics bool False Return intermediate computations if available.
Returns typing.Dict Returns the reciprocal rank value.

Compute reciprocal rank:

evaluation = rr_2.evaluate_query(
    query_results=query_results,
    relevant_docs=relevant_docs,
    id_field="vespa_id_field",
    default_score=0,
)
evaluation
{'reciprocal_rank_2': 0.5}

source

NormalizedDiscountedCumulativeGain

 NormalizedDiscountedCumulativeGain (at:int)

Compute the normalized discounted cumulative gain at position at.

Type Details
at int Maximum position on the resulting list to look for relevant docs.

Instantiate the metric:

ndcg_1 = NormalizedDiscountedCumulativeGain(at=1)
ndcg_2 = NormalizedDiscountedCumulativeGain(at=2)
ndcg_3 = NormalizedDiscountedCumulativeGain(at=3)

source

NormalizedDiscountedCumulativeGain.evaluate_query

 NormalizedDiscountedCumulativeGain.evaluate_query
                                                    (query_results:vespa.i
                                                    o.VespaQueryResponse, 
                                                    relevant_docs:List[Dic
                                                    t], id_field:str,
                                                    default_score:int, det
                                                    ailed_metrics=False)

Evaluate query results according to normalized discounted cumulative gain.

There is an assumption that documents returned by the query that are not included in the set of relevant documents have score equal to zero. Similarly, if the query returns a number N < at documents, we will assume that those N - at missing scores are equal to zero.

Type Default Details
query_results VespaQueryResponse Raw query results returned by Vespa.
relevant_docs typing.List[typing.Dict] Each dict contains a doc id a optionally a doc score.
id_field str The Vespa field representing the document id.
default_score int Score to assign to the additional documents that are not relevant. Default to 0.
detailed_metrics bool False Return intermediate computations if available.
Returns typing.Dict Returns the normalized discounted cumulative gain. In addition, if detailed_metrics=False, returns the ideal discounted cumulative gain _ideal_dcg, the discounted cumulative gain _dcg.

Compute NDCG:

metric = NormalizedDiscountedCumulativeGain(at=2)
evaluation = ndcg_2.evaluate_query(
    query_results=query_results,
    relevant_docs=relevant_docs,
    id_field="vespa_id_field",
    default_score=0,
)
evaluation
{'ndcg_2': 0.38685280723454163}

Return detailed metrics, in addition to NDCG:

evaluation = ndcg_2.evaluate_query(
    query_results=query_results,
    relevant_docs=relevant_docs,
    id_field="vespa_id_field",
    default_score=0,
    detailed_metrics=True,
)
evaluation
{'ndcg_2': 0.38685280723454163,
 'ndcg_2_ideal_dcg': 1.6309297535714575,
 'ndcg_2_dcg': 0.6309297535714575}

Evaluation queries in batch


source

evaluate

 evaluate (app:vespa.application.Vespa,
           labeled_data:Union[List[Dict],pandas.core.frame.DataFrame],
           eval_metrics:List[__main__.EvalMetric], query_model:Union[learn
           torank.query.QueryModel,List[learntorank.query.QueryModel]],
           id_field:str, default_score:int=0, detailed_metrics=False,
           per_query=False, aggregators=None, timeout=1000, **kwargs)

Evaluate a QueryModel according to a list of EvalMetric.

Type Default Details
app Vespa Connection to a Vespa application.
labeled_data typing.Union[typing.List[typing.Dict], pandas.core.frame.DataFrame] Data containing query, query_id and relevant docs. See examples below for format.
eval_metrics typing.List[main.EvalMetric] Evaluation metrics
query_model typing.Union[learntorank.query.QueryModel, typing.List[learntorank.query.QueryModel]] Query models to be evaluated
id_field str The Vespa field representing the document id.
default_score int 0 Score to assign to the additional documents that are not relevant.
detailed_metrics bool False Return intermediate computations if available.
per_query bool False Set to True to return evaluation metrics per query.
aggregators NoneType None Used only if per_query=False. List of pandas friendly aggregators to summarize per model metrics. We use [“mean”, “median”, “std”] by default.
timeout int 1000 Vespa query timeout in ms.
kwargs
Returns DataFrame Returns query_id and metrics according to the selected evaluation metrics.

Usage:

Setup and feed a Vespa application:

from learntorank.passage import create_basic_search_package
from learntorank.passage import PassageData
from vespa.deployment import VespaDocker
app_package = create_basic_search_package(name="EvaluationApp")
vespa_docker = VespaDocker(port=8082, cfgsrv_port=19072)
app = vespa_docker.deploy(application_package=app_package)
data = PassageData.load()
responses = app.feed_df(
    df=data.get_corpus(), 
    include_id=True, 
    id_field="doc_id"
)

Define query models to be evaluated:

from learntorank.query import OR, Ranking
bm25_query_model = QueryModel(
    name="bm25", 
    match_phase=OR(), 
    ranking=Ranking(name="bm25")
)
native_query_model = QueryModel(
    name="native_rank", 
    match_phase=OR(), 
    ranking=Ranking(name="native_rank")
)

Define metrics to compute during evaluation:

metrics = [
    Recall(at=10), 
    ReciprocalRank(at=3), 
    NormalizedDiscountedCumulativeGain(at=3)
]

Get labeled data:

labeled_data = data.get_labels(type="dev")
labeled_data[0:2]
[{'query_id': '1101971',
  'query': 'why say the sky is the limit',
  'relevant_docs': [{'id': '7407715', 'score': 1}]},
 {'query_id': '712898',
  'query': 'what is an cvc in radiology',
  'relevant_docs': [{'id': '7661336', 'score': 1}]}]

Evaluate:

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    eval_metrics=metrics, 
    query_model=[native_query_model, bm25_query_model], 
    id_field="doc_id",
)
evaluation
model bm25 native_rank
recall_10 mean 0.935833 0.845833
median 1.000000 1.000000
std 0.215444 0.342749
reciprocal_rank_3 mean 0.935000 0.746667
median 1.000000 1.000000
std 0.231977 0.399551
ndcg_3 mean 0.912839 0.740814
median 1.000000 1.000000
std 0.242272 0.387611

The evaluate function also accepts labeled data as a data frame:

labeled_df.head()
qid query doc_id relevance
0 1101971 why say the sky is the limit 7407715 1
1 712898 what is an cvc in radiology 7661336 1
2 154469 dmv california how long does it take to get id 7914544 1
3 930015 what's an epigraph 7928705 1
4 860085 what is va tax 2915383 1
evaluation_df = evaluate(
    app=app,
    labeled_data=labeled_df, 
    eval_metrics=metrics, 
    query_model=[native_query_model, bm25_query_model], 
    id_field="doc_id",
)
evaluation_df
model bm25 native_rank
recall_10 mean 0.935833 0.845833
median 1.000000 1.000000
std 0.215444 0.342749
reciprocal_rank_3 mean 0.935000 0.746667
median 1.000000 1.000000
std 0.231977 0.399551
ndcg_3 mean 0.912839 0.740814
median 1.000000 1.000000
std 0.242272 0.387611

Control which aggregators are computed:

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    eval_metrics=metrics, 
    query_model=[native_query_model, bm25_query_model], 
    id_field="doc_id",
    aggregators=["mean", "std"]
)
evaluation
model bm25 native_rank
recall_10 mean 0.935833 0.845833
std 0.215444 0.342749
reciprocal_rank_3 mean 0.935000 0.746667
std 0.231977 0.399551
ndcg_3 mean 0.912839 0.740814
std 0.242272 0.387611

Include detailed metrics when available, this includes intermediate steps that are available for some of the metrics:

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    eval_metrics=metrics, 
    query_model=[native_query_model, bm25_query_model], 
    id_field="doc_id",
    aggregators=["mean", "std"],
    detailed_metrics=True
)
evaluation
model bm25 native_rank
recall_10 mean 0.935833 0.845833
std 0.215444 0.342749
reciprocal_rank_3 mean 0.935000 0.746667
std 0.231977 0.399551
ndcg_3 mean 0.912839 0.740814
std 0.242272 0.387611
ndcg_3_ideal_dcg mean 1.054165 1.054165
std 0.207315 0.207315
ndcg_3_dcg mean 0.938928 0.765474
std 0.225533 0.387161

Generate results per query:

evaluation = evaluate(
    app=app,
    labeled_data=labeled_data, 
    eval_metrics=metrics, 
    query_model=[native_query_model, bm25_query_model], 
    id_field="doc_id",
    per_query=True
)
evaluation.head()
model query_id recall_10 reciprocal_rank_3 ndcg_3
0 native_rank 1101971 1.0 1.0 1.0
1 native_rank 712898 0.0 0.0 0.0
2 native_rank 154469 1.0 0.0 0.0
3 native_rank 930015 1.0 0.0 0.0
4 native_rank 860085 0.0 0.0 0.0

Evaluate specific query


source

evaluate_query

 evaluate_query (app:vespa.application.Vespa,
                 eval_metrics:List[__main__.EvalMetric],
                 query_model:learntorank.query.QueryModel, query_id:str,
                 query:str, id_field:str, relevant_docs:List[Dict],
                 default_score:int=0, detailed_metrics=False, **kwargs)

Evaluate a single query according to evaluation metrics

Type Default Details
app Vespa Connection to a Vespa application.
eval_metrics typing.List[main.EvalMetric] Evaluation metrics
query_model QueryModel Query model to be evaluated
query_id str Query id represented as str.
query str Query string.
id_field str The Vespa field representing the document id.
relevant_docs typing.List[typing.Dict] Each dict contains a doc id a optionally a doc score.
default_score int 0 Score to assign to the additional documents that are not relevant.
detailed_metrics bool False Return intermediate computations if available.
kwargs
Returns typing.Dict Contains query_id and metrics according to the selected evaluation metrics.

Usage:

app = Vespa(url = "https://api.cord19.vespa.ai")
query_model = QueryModel(
    match_phase = OR(),
    ranking = Ranking(name="bm25", list_features=True))

Evaluate a single query:

query_evaluation = evaluate_query(
    app=app,
    eval_metrics = eval_metrics, 
    query_model = bm25_query_model, 
    query_id = "0", 
    query = "Intrauterine virus infections and congenital heart disease", 
    id_field = "id",
    relevant_docs = [{"id": 0, "score": 1}, {"id": 3, "score": 1}],
    default_score = 0
)
query_evaluation
{'model': 'bm25',
 'query_id': '0',
 'match_ratio': 0.814424921006077,
 'recall_10': 0.0,
 'reciprocal_rank_10': 0}

Evaluate query under specific document ids

Use recall to specify which documents should be included in the evaluation.

In the example below, we include documents with id equal to 0, 1 and 2. Since the relevant documents for this query are the documents with id 0 and 3, we should get recall equal to 0.5.

query_evaluation = evaluate_query(
    app=app,
    eval_metrics = eval_metrics, 
    query_model = query_model, 
    query_id = 0, 
    query = "Intrauterine virus infections and congenital heart disease", 
    id_field = "id",
    relevant_docs = [{"id": 0, "score": 1}, {"id": 3, "score": 1}],
    default_score = 0,
    recall = ("id", [0, 1, 2])
)
query_evaluation
{'model': 'default_name',
 'query_id': 0,
 'match_ratio': 9.70242657688688e-06,
 'recall_10': 0.5,
 'reciprocal_rank_10': 1.0}

We now include documents with id equal to 0, 1, 2 and 3. This should give a recall equal to 1.

query_evaluation = evaluate_query(
    app=app,
    eval_metrics = eval_metrics, 
    query_model = query_model, 
    query_id = 0, 
    query = "Intrauterine virus infections and congenital heart disease", 
    id_field = "id",
    relevant_docs = [{"id": 0, "score": 1}, {"id": 3, "score": 1}],
    default_score = 0,
    recall = ("id", [0, 1, 2, 3])
)
query_evaluation
{'model': 'default_name',
 'query_id': 0,
 'match_ratio': 1.2936568769182506e-05,
 'recall_10': 1.0,
 'reciprocal_rank_10': 1.0}