The Coronavirus Resource Center is composed of an extremely talented, diverse team that is dedicated to providing the best public data available. Our multi-step data visualization process is highly iterative, allowing for input at all levels before releasing data to our viewers. This commitment to high-quality public data should be modeled moving forward.
The Coronavirus Resource Center (CRC) strives to consistently provide the best public data on COVID-19 despite less frequent reporting and dashboard closures at the state level. As we continue to advocate for public health data — and in the spirit of transparency — we would like to show you how we find, process, store, and report data. As a reminder, all data the CRC uses (apart from census, hospitalization, and variant data) is available in our public CCI and CSSE Github repositories, located here and here, respectively.
Like all clichéd business stories will tell you, it all starts with an idea. We receive regular input from the public, news media, and our own team with ideas for new, informative data visualizations or expansions and updates to existing visualizations. One great example is our Racial Data Transparency map. When that visualization launched, we were only able to provide information on which states provided data disaggregated by race (below).
Based on feedback from our team and interest from the public, we worked to expand this visualization to include the actual proportions of testing, cases, deaths, and vaccinations broken down by race, ethnicity, sex, and age. This new visualization (accessed here) and shown at the end of this blog) is a major improvement that required contributions from every member of the CRC team. How did we do it?
Once we have an idea, our expert data team at the Johns Hopkins Applied Physics Laboratory (APL) must determine if the necessary data exist. Team members prioritize investigating primary sources, such as governments and departments of health. Sometimes this includes less traditional sources to find the most complete and up-to-date data, such as official government social media (i.e., Facebook) accounts, as is the case for the U.S. Virgin Islands.
After identifying a reliable source for the raw data, APL begins to collect data manually while simultaneously developing a data scraper to automate future data collection from the source (if possible). Scrapers are evaluated by performing manual data collection alongside scraping for a few weeks and comparing the results to ensure they match. Once the scraper is approved, it is set up to automatically collect data from the source every 30 minutes. Due to the highly variable way in which data is provided publicly across sources, it is not always possible to automate data collection. For some data streams, such as demographic information and state policies, the data must be collected manually, and this is done by the team at the Centers for Civic Impact (CCI). These manual data collection processes are highly labor intensive, and thus can not be conducted as frequently as the automated data scraping — e.g., the demographic data is updated every two weeks.
We then reach out to our colleagues at Finsbury Glover Hering (FGH), a leading global strategic communications and public affairs consultancy, to help with web development, design, and placement on the website. FGH begins creating comprehensive layouts (comps) based on the type of data and previous work at the CRC. Creating comps can be done quickly and allow for multiple iterations. An example for the demographic data visualization is shown below.
The development and feedback comp process occurs in parallel with iterative reviews of the visualization produced by CCI. The entire CRC team is involved with the review processes to best ensure accessibility, content clarity, labeling consistency, ideal placement on the site, and visual appeal. CCI focuses on data accuracy and sharing the correct story while FGH works on style, web compatibility, and the location on the website. When both teams have settled on a style and function that the entire CRC approves of, FGH can then take the visualization into staging to prepare it for release.
Throughout the design process, team members work on validating and maintaining all of our data streams, including the data for the new visualization. We have built anomaly detection systems for all data streams, in which our software compares new data against predefined thresholds that indicate when data points are outside expected ranges. These anomalies can occur for a variety of reasons, such as when a state releases data from multiple days all at once as often occurs after a holiday weekend, when retrospective reporting is done by states and counties that results in large data dumps, when the source changes the structure or format of the data resulting in variables being misread in, or simply when there is human error in the data entry at the source. When data cannot be retrieved or certain data points are outside predetermined thresholds, the anomaly detection infrastructure is triggered and the questionable data is withheld.
When anomaly detection triggers the withholding of data, we begin a manual investigation, seeking out additional sources, such as news articles, or reaching out directly to the organization providing the data. If the data needs to be amended, we do so. No anomalous data points are released without confirmation through manual reinvestigation. Data that passes anomaly detection gets uploaded to the CRC website every hour.
Additionally, when sources change their data formatting or dashboard design, our scrapers can no longer retrieve the data. Beyond causing anomalies in the data, this can also cause a scraper to “break,” alerting the team at APL through a separate data integrity checking structure. APL can then investigate the source and modify the scraper or provide the updated data manually while scraper development begins anew.
When data and the visualization are both ready for the public, we release the new visualization to the CRC website after approval from our entire executive team. The work is not done yet though, as we enter the maintenance phase in perpetuity. Every visualization on the CRC undergoes a daily health check. This allows us to identify errors where the site is not updating, which for example occur when the source did not release new data for a given day. Our team manually checks these errors and confirms that our systems are still working properly. If certain anomalies cannot be resolved, we develop a data note to explain the situation, which is posted on the CRC and/or in the public Github repository. We remain available for inquiries from the media and public while working on the next datasets and visualizations in the pipeline.
What we do would not be possible without our enormous, dedicated team with diverse backgrounds in public health, data science, computer science, communications, graphic design, and more. We are proud of the work we have been able to create and share with the public thus far, and will continue to help inform you of all the pandemic data we can. But, critically, this work only remains possible if states and health organizations continue to provide publicly accessible data.