How to download and parse TREC-COVID data

Your first step to contribute to the improvement of the cord19 search application.

Download the data

The files used in this section were originally found at https://ir.nist.gov/covidSubmit/data.html. We will download both the topics and the relevance judgements data. Do not worry about what they are just yet, we will explore them soon.

!curl -fsSLO https://data.vespa.oath.cloud/blog/cord19/topics-rnd5.xml
!curl -fsSLO https://data.vespa.oath.cloud/blog/cord19/qrels-covid_d5_j0.5-5.txt

Parse the data

Topics

The topics file is in XML format. We can parse it and store in a dictionary called topics. We want to extract a query, a question and a narrative from each topic.

import xml.etree.ElementTree as ET

topics = {}
root = ET.parse("topics-rnd5.xml").getroot()
for topic in root.findall("topic"):
    topic_number = topic.attrib["number"]
    topics[topic_number] = {}
    for query in topic.findall("query"):
        topics[topic_number]["query"] = query.text
    for question in topic.findall("question"):
        topics[topic_number]["question"] = question.text        
    for narrative in topic.findall("narrative"):
        topics[topic_number]["narrative"] = narrative.text

There are a total of 50 topics. For example, we can see the first topic below:

topics["1"]
{'query': 'coronavirus origin',
 'question': 'what is the origin of COVID-19',
 'narrative': "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}

Each topic has many relevance judgements associated with them.

Relevance judgements

We can load the relevance judgement data directly into a pandas DataFrame.

import pandas as pd

relevance_data = pd.read_csv("qrels-covid_d5_j0.5-5.txt", sep=" ", header=None)
relevance_data.columns = ["topic_id", "round_id", "cord_uid", "relevancy"]

The relevance data contains all the relevance judgements made throughout the 5 rounds of the competition. relevancy equals to 0 is irrelevant, 1 is relevant and 2 is highly relevant.

relevance_data.head()
topic_id round_id cord_uid relevancy
0 1 4.5 005b2j4b 2
1 1 4.0 00fmeepz 1
2 1 0.5 010vptx3 2
3 1 2.5 0194oljo 1
4 1 4.0 021q9884 1

We are going to remove two rows that have relevancy equal to -1, which I am assuming is an error.

relevance_data[relevance_data.relevancy == -1]
topic_id round_id cord_uid relevancy
55873 38 5.0 9hbib8b3 -1
69173 50 5.0 ucipq8uk -1
relevance_data = relevance_data[relevance_data.relevancy >= 0]

Next we will discuss how we can use this data to evaluate and improve cord19 search app.