searchSearch data by region...
Pandemic Data Outlook

Our Person-Powered Public Data Machine

The Coronavirus Resource Center is composed of an extremely talented, diverse team that is dedicated to providing the best public data available. Our multi-step data visualization process is highly iterative, allowing for input at all levels before releasing data to our viewers. This commitment to high-quality public data should be modeled moving forward.

Share
Authors:
Beth Blauer, Associate Vice Provost, JHU
Lauren Gardner, Associate Professor, WSE, JHU
September 14, 2021

The Coronavirus Resource Center (CRC) strives to consistently provide the best public data on COVID-19 despite less frequent reporting and dashboard closures at the state level. As we continue to advocate for public health data — and in the spirit of transparency — we would like to show you how we find, process, store, and report data. As a reminder, all data the CRC uses (apart from census, hospitalization, and variant data) is available in our public CCI and CSSE Github repositories, located here and here, respectively.

Eureka!

Like all clichéd business stories will tell you, it all starts with an idea. We receive regular input from the public, news media, and our own team with ideas for new, informative data visualizations or expansions and updates to existing visualizations. One great example is our Racial Data Transparency map. When that visualization launched, we were only able to provide information on which states provided data disaggregated by race (below).

Race-Transparency-Map-Old.jpg

Based on feedback from our team and interest from the public, we worked to expand this visualization to include the actual proportions of testing, cases, deaths, and vaccinations broken down by race, ethnicity, sex, and age. This new visualization (accessed here) and shown at the end of this blog) is a major improvement that required contributions from every member of the CRC team. How did we do it?

Deep Dive for Data

Once we have an idea, our expert data team at the Johns Hopkins Applied Physics Laboratory (APL) must determine if the necessary data exist. Team members prioritize investigating primary sources, such as governments and departments of health. Sometimes this includes less traditional sources to find the most complete and up-to-date data, such as official government social media (i.e., Facebook) accounts, as is the case for the U.S. Virgin Islands.

After identifying a reliable source for the raw data, APL begins to collect data manually while simultaneously developing a data scraper to automate future data collection from the source (if possible). Scrapers are evaluated by performing manual data collection alongside scraping for a few weeks and comparing the results to ensure they match. Once the scraper is approved, it is set up to automatically collect data from the source every 30 minutes. Due to the highly variable way in which data is provided publicly across sources, it is not always possible to automate data collection. For some data streams, such as demographic information and state policies, the data must be collected manually, and this is done by the team at the Centers for Civic Impact (CCI). These manual data collection processes are highly labor intensive, and thus can not be conducted as frequently as the automated data scraping — e.g., the demographic data is updated every two weeks.

Sharing a Story with a Broad Audience

With the data in hand, the CCI team will build exploratory analysis tools to see if there's a story in the data worth visualizing. Once the story is approved by CRC leadership, the developers at CCI create an initial visualization in R, python, or javascript. Below is an early prototype for COVID-19 death data broken down by race.

Demo-Prototype.jpg

We then reach out to our colleagues at Finsbury Glover Hering (FGH), a leading global strategic communications and public affairs consultancy, to help with web development, design, and placement on the website. FGH begins creating comprehensive layouts (comps) based on the type of data and previous work at the CRC. Creating comps can be done quickly and allow for multiple iterations. An example for the demographic data visualization is shown below.

Comp-Image.jpg

The development and feedback comp process occurs in parallel with iterative reviews of the visualization produced by CCI. The entire CRC team is involved with the review processes to best ensure accessibility, content clarity, labeling consistency, ideal placement on the site, and visual appeal. CCI focuses on data accuracy and sharing the correct story while FGH works on style, web compatibility, and the location on the website. When both teams have settled on a style and function that the entire CRC approves of, FGH can then take the visualization into staging to prepare it for release.

Doing Our Due Diligence

Throughout the design process, team members work on validating and maintaining all of our data streams, including the data for the new visualization. We have built anomaly detection systems for all data streams, in which our software compares new data against predefined thresholds that indicate when data points are outside expected ranges. These anomalies can occur for a variety of reasons, such as when a state releases data from multiple days all at once as often occurs after a holiday weekend, when retrospective reporting is done by states and counties that results in large data dumps, when the source changes the structure or format of the data resulting in variables being misread in, or simply when there is human error in the data entry at the source. When data cannot be retrieved or certain data points are outside predetermined thresholds, the anomaly detection infrastructure is triggered and the questionable data is withheld.

When anomaly detection triggers the withholding of data, we begin a manual investigation, seeking out additional sources, such as news articles, or reaching out directly to the organization providing the data. If the data needs to be amended, we do so. No anomalous data points are released without confirmation through manual reinvestigation. Data that passes anomaly detection gets uploaded to the CRC website every hour.

Additionally, when sources change their data formatting or dashboard design, our scrapers can no longer retrieve the data. Beyond causing anomalies in the data, this can also cause a scraper to “break,” alerting the team at APL through a separate data integrity checking structure. APL can then investigate the source and modify the scraper or provide the updated data manually while scraper development begins anew.

Release and Reception

When data and the visualization are both ready for the public, we release the new visualization to the CRC website after approval from our entire executive team. The work is not done yet though, as we enter the maintenance phase in perpetuity. Every visualization on the CRC undergoes a daily health check. This allows us to identify errors where the site is not updating, which for example occur when the source did not release new data for a given day. Our team manually checks these errors and confirms that our systems are still working properly. If certain anomalies cannot be resolved, we develop a data note to explain the situation, which is posted on the CRC and/or in the public Github repository. We remain available for inquiries from the media and public while working on the next datasets and visualizations in the pipeline.

CRC-Demo-Map-New.jpg

What we do would not be possible without our enormous, dedicated team with diverse backgrounds in public health, data science, computer science, communications, graphic design, and more. We are proud of the work we have been able to create and share with the public thus far, and will continue to help inform you of all the pandemic data we can. But, critically, this work only remains possible if states and health organizations continue to provide publicly accessible data.

Beth Blauer, Associate Vice Provost, JHU

Beth Blauer is the Associate Vice Provost for Public Sector Innovation and Executive Director of the Centers for Civic Impact at Johns Hopkins. Blauer and her team transform raw COVID-19 data into clear and compelling visualizations that help policymakers and the public understand the pandemic and make evidence-based decisions about health and safety.

Lauren Gardner, Associate Professor, WSE, JHU

Dr. Lauren Gardner and her team at the CSSE built the COVID-19 global tracking map in January 2020, creating the most comprehensive publicly available data set on the pandemic. Dr. Gardner’s data drives much of the CRC’s analysis and serves as a vital resource for millions of users to track the pandemic.