!curl -fsSLO https://data.vespa.oath.cloud/blog/cord19/topics-rnd5.xml
!curl -fsSLO https://data.vespa.oath.cloud/blog/cord19/qrels-covid_d5_j0.5-5.txt
How to download and parse TREC-COVID data
Download the data
The files used in this section were originally found at https://ir.nist.gov/covidSubmit/data.html
. We will download both the topics and the relevance judgements data. Do not worry about what they are just yet, we will explore them soon.
Parse the data
Topics
The topics file is in XML format. We can parse it and store in a dictionary called topics
. We want to extract a query
, a question
and a narrative
from each topic.
import xml.etree.ElementTree as ET
= {}
topics = ET.parse("topics-rnd5.xml").getroot()
root for topic in root.findall("topic"):
= topic.attrib["number"]
topic_number = {}
topics[topic_number] for query in topic.findall("query"):
"query"] = query.text
topics[topic_number][for question in topic.findall("question"):
"question"] = question.text
topics[topic_number][for narrative in topic.findall("narrative"):
"narrative"] = narrative.text topics[topic_number][
There are a total of 50 topics. For example, we can see the first topic below:
"1"] topics[
{'query': 'coronavirus origin',
'question': 'what is the origin of COVID-19',
'narrative': "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}
Each topic has many relevance judgements associated with them.
Relevance judgements
We can load the relevance judgement data directly into a pandas DataFrame
.
import pandas as pd
= pd.read_csv("qrels-covid_d5_j0.5-5.txt", sep=" ", header=None)
relevance_data = ["topic_id", "round_id", "cord_uid", "relevancy"] relevance_data.columns
The relevance data contains all the relevance judgements made throughout the 5 rounds of the competition. relevancy
equals to 0 is irrelevant, 1 is relevant and 2 is highly relevant.
relevance_data.head()
topic_id | round_id | cord_uid | relevancy | |
---|---|---|---|---|
0 | 1 | 4.5 | 005b2j4b | 2 |
1 | 1 | 4.0 | 00fmeepz | 1 |
2 | 1 | 0.5 | 010vptx3 | 2 |
3 | 1 | 2.5 | 0194oljo | 1 |
4 | 1 | 4.0 | 021q9884 | 1 |
We are going to remove two rows that have relevancy equal to -1, which I am assuming is an error.
== -1] relevance_data[relevance_data.relevancy
topic_id | round_id | cord_uid | relevancy | |
---|---|---|---|---|
55873 | 38 | 5.0 | 9hbib8b3 | -1 |
69173 | 50 | 5.0 | ucipq8uk | -1 |
= relevance_data[relevance_data.relevancy >= 0] relevance_data
Next we will discuss how we can use this data to evaluate and improve cord19 search app.