!curl -fsSLO https://data.vespa.oath.cloud/blog/cord19/topics-rnd5.xml
!curl -fsSLO https://data.vespa.oath.cloud/blog/cord19/qrels-covid_d5_j0.5-5.txtHow to download and parse TREC-COVID data
Download the data
The files used in this section were originally found at https://ir.nist.gov/covidSubmit/data.html. We will download both the topics and the relevance judgements data. Do not worry about what they are just yet, we will explore them soon.
Parse the data
Topics
The topics file is in XML format. We can parse it and store in a dictionary called topics. We want to extract a query, a question and a narrative from each topic.
import xml.etree.ElementTree as ET
topics = {}
root = ET.parse("topics-rnd5.xml").getroot()
for topic in root.findall("topic"):
topic_number = topic.attrib["number"]
topics[topic_number] = {}
for query in topic.findall("query"):
topics[topic_number]["query"] = query.text
for question in topic.findall("question"):
topics[topic_number]["question"] = question.text
for narrative in topic.findall("narrative"):
topics[topic_number]["narrative"] = narrative.textThere are a total of 50 topics. For example, we can see the first topic below:
topics["1"]{'query': 'coronavirus origin',
'question': 'what is the origin of COVID-19',
'narrative': "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}
Each topic has many relevance judgements associated with them.
Relevance judgements
We can load the relevance judgement data directly into a pandas DataFrame.
import pandas as pd
relevance_data = pd.read_csv("qrels-covid_d5_j0.5-5.txt", sep=" ", header=None)
relevance_data.columns = ["topic_id", "round_id", "cord_uid", "relevancy"]The relevance data contains all the relevance judgements made throughout the 5 rounds of the competition. relevancy equals to 0 is irrelevant, 1 is relevant and 2 is highly relevant.
relevance_data.head()| topic_id | round_id | cord_uid | relevancy | |
|---|---|---|---|---|
| 0 | 1 | 4.5 | 005b2j4b | 2 |
| 1 | 1 | 4.0 | 00fmeepz | 1 |
| 2 | 1 | 0.5 | 010vptx3 | 2 |
| 3 | 1 | 2.5 | 0194oljo | 1 |
| 4 | 1 | 4.0 | 021q9884 | 1 |
We are going to remove two rows that have relevancy equal to -1, which I am assuming is an error.
relevance_data[relevance_data.relevancy == -1]| topic_id | round_id | cord_uid | relevancy | |
|---|---|---|---|---|
| 55873 | 38 | 5.0 | 9hbib8b3 | -1 |
| 69173 | 50 | 5.0 | ucipq8uk | -1 |
relevance_data = relevance_data[relevance_data.relevancy >= 0]Next we will discuss how we can use this data to evaluate and improve cord19 search app.