Overview of the COVID-19 Open Research Dataset (CORD-19) + Kaggle Challenge

coronavirus graphic

It seems impossible at this point for anyone not to have heard about or been impacted by the global coronavirus pandemic.

Even Jared Leto is up to speed.

jaredleto.png


Everyone is trying to make sense of things, especially the medical community on the front lines of it all.

So much new research is coming out about COVID-19 that it is hard - even impossible - to sift through all of it in a meaningful way.


The White House and several research groups - listed here - have released the COVID-19 Open Research Dataset (CORD-19), which is a massive dataset of scholarly papers related to COVID-19 and other coronaviruses.

The COVID-19 Open Research Dataset Challenge has been launched on Kaggle as well.

I'm going to go over a bit about the Kaggle challenge, as well as mention some other hackathons that are happening for COVID-19-related projects.

CORD-19

CORD-19 is a corpus with over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19 and other coronaviruses like SARS-CoV-2.

The dataset will be updated weekly with new research as it is published.


Call to action

The goal is for researchers to apply Natural Language Processing (NLP) techniques to develop tools for text and data mining.

Tools are needed that will help the medical community use all of this information to answer important scientific questions and put it to use in the fight against the virus, as well as help to learn more about the pandemic in general.


The data

The dataset contains all COVID-19 and coronavirus-related research (e.g. SARS, MERS, etc.) from the following sources:

  • PubMed's PMC open access corpus
  • Additional COVID-19 research articles from a corpus maintained by the WHO
  • bioRxiv and medRxiv pre-prints using the same query as PMC - click the above Pubmed PMC link to see the query

This comes from Semantic Scholar's website for the dataset, where you can also download the data.

About the dataset

  • Each paper is a single JSON object.
  • You can view the schema here.
  • There is also a comprehensive metadata file available, and they recommend using metadata from that instead of the parsed metadata in the dataset.

Kaggle challenge

As mentioned, Kaggle is hosting a competition for this, and they have a bunch of questions that anyone can get started with.

From the Kaggle website, they are sponsoring a $1,000 per task award to the winner whose submission is identified as best meeting the evaluation criteria.

Tasks

Right now Kaggle lists the following tasks for this challenge - you can find them here.


  • What is known about transmission, incubation, and environmental stability?
  • What do we know about COVID-19 risk factors?
  • What do we know about virus genetics, origin, and evolution?
  • Help us understand how geography affects virality.
  • What do we know about non-pharmaceutical interventions?
  • What do we know about vaccines and therapeutics?
  • What has been published about information sharing and inter-sectoral collaboration?
  • What has been published about ethical and social science considerations?
  • What do we know about diagnostics and surveillance?
  • What has been published about medical care?

This is the initial list, but there will probably be more added, so keep checking back.


Tools and Resources

There are a lot of tools for working with CORD-19 data, as well as other data feeds related to COVID-19.

This is far from an exhaustive list and is just some that I've seen mentioned on reddit or on the website for the CORD-19 dataset.

Tools and models

Data feeds and APIs

Hackathons

Along with the Kaggle competition, other hackathons related to the COVID-19 pandemic have been popping up.

Be well, and thanks for reading!

Many of us have some extra time on our hands since we're supposed to be staying indoors.

Even if you're not familiar with data science or natural language processing, you could dip your toes in with this and play around with the data.

Any questions or comments, write them below or reach out on Twitter @LVNGD.


Research groups behind the CORD-19 dataset

  • Allen Institute for AI
  • Chan Zuckerberg Initiative
  • Georgetown University’s Center for Security and Emerging Technology
  • Microsoft Research
  • National Library of Medicine - National Institutes of Health
  • White House Office of Science and Technology Policy

Share On
blog comments powered by Disqus

Recent Posts

Lorem Ipsum with various Google Fonts
How to embed a Google Font into an SVG
July 1, 2020

If you use a Google Font in an SVG visualization and then try to save it as a file, you might find that the font was not preserved in the saved file. To remedy that, we will look at how to embed a custom font into an SVG with base64 encoding.

Read More
nyc map outline graphic
Using ogr2ogr to convert Shapefiles to GeoJSON
June 20, 2020

In this post we will use the ogr2ogr command line tool from GDAL to convert a shapefile of NYC zip code boundary data to GeoJSON format, as well as convert the projected coordinates to latitude and longitude, in one line of code.

Read More
Multi Foci Cluster Chart Graphic
Building a Multi-Foci Force Layout Bubble Chart in D3.js
June 12, 2020

You might be familiar with force layouts in D3.js to create things like bubble charts, network graphs and many other types of visualizations. In this post we will create a force layout bubble chart with multiple clusters along a timeline.

Read More
Get the latest posts as soon as they come out!