How to download and parse TREC-COVID data

Your first step to contribute to the improvement of the cord19 search application.

Download the data

The files used in this section were originally found at We will download both the topics and the relevance judgements data. Do not worry about what they are just yet, we will explore them soon.

!curl -fsSLO
!curl -fsSLO

Parse the data


The topics file is in XML format. We can parse it and store in a dictionary called topics. We want to extract a query, a question and a narrative from each topic.

import xml.etree.ElementTree as ET

topics = {}
root = ET.parse("topics-rnd5.xml").getroot()
for topic in root.findall("topic"):
    topic_number = topic.attrib["number"]
    topics[topic_number] = {}
    for query in topic.findall("query"):
        topics[topic_number]["query"] = query.text
    for question in topic.findall("question"):
        topics[topic_number]["question"] = question.text        
    for narrative in topic.findall("narrative"):
        topics[topic_number]["narrative"] = narrative.text

There are a total of 50 topics. For example, we can see the first topic below:

{'query': 'coronavirus origin',
 'question': 'what is the origin of COVID-19',
 'narrative': "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}

Each topic has many relevance judgements associated with them.

Relevance judgements

We can load the relevance judgement data directly into a pandas DataFrame.

import pandas as pd

relevance_data = pd.read_csv("qrels-covid_d5_j0.5-5.txt", sep=" ", header=None)
relevance_data.columns = ["topic_id", "round_id", "cord_uid", "relevancy"]

The relevance data contains all the relevance judgements made throughout the 5 rounds of the competition. relevancy equals to 0 is irrelevant, 1 is relevant and 2 is highly relevant.

topic_id round_id cord_uid relevancy
0 1 4.5 005b2j4b 2
1 1 4.0 00fmeepz 1
2 1 0.5 010vptx3 2
3 1 2.5 0194oljo 1
4 1 4.0 021q9884 1

We are going to remove two rows that have relevancy equal to -1, which I am assuming is an error.

relevance_data[relevance_data.relevancy == -1]
topic_id round_id cord_uid relevancy
55873 38 5.0 9hbib8b3 -1
69173 50 5.0 ucipq8uk -1
relevance_data = relevance_data[relevance_data.relevancy >= 0]

Next we will discuss how we can use this data to evaluate and improve cord19 search app.