searchSearch data by region...
Pandemic Data Outlook

Preserving the Life Cycle of Pandemic Data

The importance of public health data will not end after this pandemic, even when state dashboards shut down and COVID-19 no longer dominates headlines. We need to start preparing to preserve these data long-term, ensuring public access and utility.

Beth Blauer, Associate Vice Provost, JHU
November 8, 2021

Journalists, researchers, and members of the public routinely contact the CRC to obtain historical data points and trends about the COVID-19 pandemic. When we have the bandwidth and whenever possible, we try to assist. As the months unfold, we are continually asking ourselves if archiving these data and making them available to the world in perpetuity is the role of one private institution. One day the pandemic will end and people will still need to access these data for research, record keeping and preparation for the next crisis. So where should it go? The storage and management of data will require a litany of new policies and regulations, and as these rules are developed, we need to ensure data governance, longevity, accessibility, and utility.

Storing Something with No Expiration Date

Faculty members are known for hoarding towers of cardboard boxes and stacks of unlabeled hard drives containing the data from the lifespans of their labs. A few of them can impressively navigate that chaos, but even if so, what happens when they retire? Is that data gone forever? While university libraries have stepped in to fill that need for academic researchers, as previously discussed with Dr. Sayeed Choudhury, we need similar data management efforts for national public health data. State and local health departments are already overburdened, making this an opportunity for the federal government to step up and provide data management services for all public health data.

The Centers for Disease Control and Prevention serves as the official recordkeeper for national health data, and it is adept at storing the data that states voluntarily provide. However, the quantity and granularity of data that the federal agency receives from states are lacking. Demographics data are often inconsistent or missing, and even basic data on cases, tests, hospitalizations, and deaths are incomplete.1 At minimum, the federal government could use some of its funding authority to enforce better data collection and reporting. This should be coupled with policy and investment to support states and local health departments to meet these requirements. However, we are now almost two years into this pandemic and federal agencies have failed to take much action on state data collection, neglecting to even introduce firm standards for COVID-19 data, something seemingly within their purview.

The CDC has experience handling data storage and access through resources like the Wide-ranging ONline Data for Epidemiologic Research (WONDER) and the National Prevention Information Network (NPIN). But these programs will need major expansion, investment, and updating to make them as complete and functional as the many private and state-run COVID-19 databases. We are committed at the CRC to archive everything we’ve collected, but we haven’t had access to all data and we are not as well-equipped as federal agencies to provide long-term data storage or manage access for long duration or detail. States cannot be expected to store that much data for every disease on an ongoing basis as expensive data storage needs increase exponentially. A strong public health data infrastructure will require large federal data repositories.

Data at Your Fingertips

Once public health data is stored properly, everyone must be able to access it. There are already petabytes worth of privately-owned data out there. Electronic health records in hospital systems contain incredibly useful health data that are protected behind firewalls. And Facebook and Google mine a tremendous amount of data from each of us every day to better sell us products and increase our usage of their programs.2, 3 We do not need more private data in the world. Public health data needed to be publicly available during this pandemic and it needs to be publicly available afterwards.

As strong proponents of public data, we believe that accessibility should not and cannot end when the pandemic does. In fact, due to copyleft and open data licensing around some pandemic data streams, COVID data will need to be fully available to the public. Few people are going to drive to the CDC or National Institutes of Health offices to sift through historic data, so it needs to be digitally accessible through open access databases. Whatever entity houses the data will need to set up infrastructure to make the data accessible to all who need it, without a paywall or security clearance.

Knowing What You’re Looking At

Once data are preserved and accessible, they need to be stored in a comprehensible, useful format. Data are often complicated and confusing due to specialized parlance used by the scientists collecting it, nuances to the recording and analyses that vary between users, or intentional miscommunication to prevent others from following the same line of research.4 Two researchers in the same lab can rarely read and understand each other’s Excel files, so it would be unreasonable to assume that even lauded researchers will be able to understand the plethora of data to come out of this pandemic without detailed explanation or access to the code behind the data analysis.

Making the data available to all is not the same as making it useful to all. Data require the codes, keys, legends, and software used to generate and analyze them in order to be understood by other interested parties. This is a major component of data governance and must be considered in all data policy crafted in response to the pandemic. As Dr. Choudhury asserted, the story of the data is almost as important as the data itself, and we need to preserve it as well. Use of open source software is one method of preserving data in a useful and accessible manner, but, as of now, states are not utilizing open source software to manage and display their COVID-19 data. Regardless, significant consideration must be given to preserving and providing data code as well as data when planning data storage.

Data governance does not simply center around generating data. We need to consider what happens to the data after they are collected, and set up appropriate storage systems now. Public health data is essential to our future, and we must preserve it. Policymakers need to start addressing this before data begins disappearing,5 giving particular attention to data longevity, accessibility, and utility.


  1. A. Fast, Millions Of People Are Missing From CDC COVID Data As States Fail To Report Cases, 01 September 2021. (Accessed 31 October 2021).
  2. C. Zakrzewski, C. Lima, E. Dwoskin, W. Oremus, Facebook whistleblower Frances Haugen tells lawmakers that meaningful reform is necessary ‘for our common good’, The Washington Post, 5 October 2021.
  3. A. Vittorio, Google Defends Sharing User Data in Ad-Targeting Class Action, 04 October 2021. (Accessed 31 October 2021).
  4. P. Glasziou, D.G. Altman, P. Bossuyt, I. Boutron, M. Clarke, S. Julious, S. Michie, D. Moher, E. Wager, Reducing waste from incomplete or unusable reports of biomedical research, The Lancet 383(9913) (2014) 267-276.
  5. K. Berg, J. Cohn, D. D’Amora, C. D’Angelo, M. Hobbes, A. Kaufman, E. Peck, K. Sheppard, S. Subramanian, Data Disappeared. (Accessed 31 October 2021).

Beth Blauer, Associate Vice Provost, JHU

Beth Blauer is the Associate Vice Provost for Public Sector Innovation and Executive Director of the Centers for Civic Impact at Johns Hopkins. Blauer and her team transform raw COVID-19 data into clear and compelling visualizations that help policymakers and the public understand the pandemic and make evidence-based decisions about health and safety.