import requests, json
from pandas import read_csv
= json.loads(
topics "https://thigm85.github.io/data/cord19/topics.json").text
requests.get(
)= read_csv("https://thigm85.github.io/data/cord19/relevance_data.csv") relevance_data
How to evaluate Vespa ranking functions from python
Download processed data
We can start by downloading the data that we have processed before.
topics
contain data about the 50 topics available, including query
, question
and narrative
.
"1"] topics[
{'query': 'coronavirus origin',
'question': 'what is the origin of COVID-19',
'narrative': "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}
relevance_data
contains the relevance judgments for each of the 50 topics.
5) relevance_data.head(
topic_id | round_id | cord_uid | relevancy | |
---|---|---|---|---|
0 | 1 | 4.5 | 005b2j4b | 2 |
1 | 1 | 4.0 | 00fmeepz | 1 |
2 | 1 | 0.5 | 010vptx3 | 2 |
3 | 1 | 2.5 | 0194oljo | 1 |
4 | 1 | 4.0 | 021q9884 | 1 |
Format the labeled data into expected pyvespa format
pyvespa
expects labeled data to follow the format illustrated below. It is a list of dict where each dict represents a query containing query_id
, query
and a list of relevant_docs
. Each relevant document contains a required id
key and an optional score
key.
= [
labeled_data
{'query_id': 1,
'query': 'coronavirus origin',
'relevant_docs': [{'id': '005b2j4b', 'score': 2}, {'id': '00fmeepz', 'score': 1}]
},
{'query_id': 2,
'query': 'coronavirus response to weather changes',
'relevant_docs': [{'id': '01goni72', 'score': 2}, {'id': '03h85lvy', 'score': 2}]
} ]
We can create labeled_data
from the topics
and relevance_data
that we downloaded before. We are only going to include documents with relevance score > 0 into the final list.
= [
labeled_data
{"query_id": int(topic_id),
"query": topics[topic_id]["query"],
"relevant_docs": [
{"id": row["cord_uid"],
"score": row["relevancy"]
for idx, row in relevance_data[relevance_data.topic_id == int(topic_id)].iterrows() if row["relevancy"] > 0
}
]for topic_id in topics.keys()] }
Define query models to be evaluated
We are going to define two query models to be evaluated here. Both will match all the documents that share at least one term with the query. This is defined by setting match_phase = OR()
.
The difference between the query models happens in the ranking phase. The or_default
model will rank documents based on nativeRank while the or_bm25
model will rank documents based on BM25. Discussion about those two types of ranking is out of the scope of this tutorial. It is enough to know that they rank documents according to two different formulas.
Those ranking profiles were defined by the team behind the cord19 app and can be found here.
from learntorank.query import QueryModel, Ranking, OR
= [
query_models
QueryModel(="or_default",
name= OR(),
match_phase = Ranking(name="default")
ranking
),
QueryModel(="or_bm25",
name= OR(),
match_phase = Ranking(name="bm25t5")
ranking
) ]
Define metrics to be used in the evaluation
We would like to compute the following metrics:
The percentage of documents matched by the query
Recall @ 10
Reciprocal rank @ 10
NDCG @ 10
from learntorank.evaluation import MatchRatio, Recall, ReciprocalRank, NormalizedDiscountedCumulativeGain
= [
eval_metrics
MatchRatio(), =10),
Recall(at=10),
ReciprocalRank(at=10)
NormalizedDiscountedCumulativeGain(at ]
Evaluate
Connect to a running Vespa instance:
from vespa.application import Vespa
= Vespa(url = "https://api.cord19.vespa.ai") app
Compute the metrics defined above for each query model.
from learntorank.evaluation import evaluate
= evaluate(
evaluations =app,
app= labeled_data,
labeled_data = eval_metrics,
eval_metrics = query_models,
query_model = "cord_uid",
id_field = 10
hits
) evaluations
model | or_bm25 | or_default | |
---|---|---|---|
match_ratio | mean | 0.411789 | 0.411789 |
median | 0.282227 | 0.282227 | |
std | 0.238502 | 0.238502 | |
recall_10 | mean | 0.007720 | 0.005457 |
median | 0.006089 | 0.003753 | |
std | 0.006386 | 0.005458 | |
reciprocal_rank_10 | mean | 0.594357 | 0.561579 |
median | 0.500000 | 0.500000 | |
std | 0.397597 | 0.401255 | |
ndcg_10 | mean | 0.353095 | 0.274515 |
median | 0.355978 | 0.253619 | |
std | 0.216460 | 0.203170 |
We can also return per query raw evaluation metrics:
= evaluate(
evaluations =app,
app= labeled_data,
labeled_data = eval_metrics,
eval_metrics = query_models,
query_model = "cord_uid",
id_field = 10,
hits = True
per_query
) evaluations.head()
model | query_id | match_ratio | recall_10 | reciprocal_rank_10 | ndcg_10 | |
---|---|---|---|---|---|---|
0 | or_default | 1 | 0.230847 | 0.008584 | 1.000000 | 0.519431 |
1 | or_default | 2 | 0.755230 | 0.000000 | 0.000000 | 0.000000 |
2 | or_default | 3 | 0.264601 | 0.001534 | 0.142857 | 0.036682 |
3 | or_default | 4 | 0.843341 | 0.001764 | 0.333333 | 0.110046 |
4 | or_default | 5 | 0.901317 | 0.003096 | 0.250000 | 0.258330 |