How to download and parse TREC-COVID data

Your first step to contribute to the improvement of the cord19 search application.

Download the data

The files used in this section were originally found at https://ir.nist.gov/covidSubmit/data.html. We will download both the topics and the relevance judgements data. Do not worry about what they are just yet, we will explore them soon.

!curl -fsSLO https://data.vespa.oath.cloud/blog/cord19/topics-rnd5.xml
!curl -fsSLO https://data.vespa.oath.cloud/blog/cord19/qrels-covid_d5_j0.5-5.txt

Parse the data

Topics

The topics file is in XML format. We can parse it and store in a dictionary called topics. We want to extract a query, a question and a narrative from each topic.

import xml.etree.ElementTree as ET

topics = {}
root = ET.parse("topics-rnd5.xml").getroot()
for topic in root.findall("topic"):
    topic_number = topic.attrib["number"]
    topics[topic_number] = {}
    for query in topic.findall("query"):
        topics[topic_number]["query"] = query.text
    for question in topic.findall("question"):
        topics[topic_number]["question"] = question.text        
    for narrative in topic.findall("narrative"):
        topics[topic_number]["narrative"] = narrative.text

There are a total of 50 topics. For example, we can see the first topic below:

topics["1"]

{'query': 'coronavirus origin',
 'question': 'what is the origin of COVID-19',
 'narrative': "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}

Each topic has many relevance judgements associated with them.

Relevance judgements

We can load the relevance judgement data directly into a pandas DataFrame.

import pandas as pd

relevance_data = pd.read_csv("qrels-covid_d5_j0.5-5.txt", sep=" ", header=None)
relevance_data.columns = ["topic_id", "round_id", "cord_uid", "relevancy"]

The relevance data contains all the relevance judgements made throughout the 5 rounds of the competition. relevancy equals to 0 is irrelevant, 1 is relevant and 2 is highly relevant.

relevance_data.head()

	topic_id	round_id	cord_uid	relevancy
0	1	4.5	005b2j4b	2
1	1	4.0	00fmeepz	1
2	1	0.5	010vptx3	2
3	1	2.5	0194oljo	1
4	1	4.0	021q9884	1

We are going to remove two rows that have relevancy equal to -1, which I am assuming is an error.

relevance_data[relevance_data.relevancy == -1]

	topic_id	round_id	cord_uid	relevancy
55873	38	5.0	9hbib8b3	-1
69173	50	5.0	ucipq8uk	-1

relevance_data = relevance_data[relevance_data.relevancy >= 0]

Next we will discuss how we can use this data to evaluate and improve cord19 search app.