Open source software licenses allow for data analysis to be shared, reviewed, and analyzed in a quick, consistent, and clear manner. This can allow researchers to circumvent complicated data use agreements and push the public health research community towards more data sharing and collaboration in a post-pandemic world.
One of the most common arguments against data sharing and collaboration is that it is too time-consuming, labor-intensive, and low-yielding. Academics can spend months generating legal documents known as data use agreements (DUAs) to share their data within their own institutions or with external groups only to discover, once they have access to the dataset, that it didn’t actually contain the data they wanted.1 The legal process becomes even more complicated with each additional party added to the DUA. Others claim that the necessary raw data become available once research is published, which is often, but not always, true. Data included with publication has often been cleaned up and trimmed down to the least comprehensive and clear picture allowable,2, 3 possibly to look professional or to prevent others from scooping future research trajectories.
As strong proponents of public availability and consumption of data, we believe that the combination of open data licensing and open source software may be the best path forward to improve the utility and accessibility of data and increase collaboration across institutions, organizations, and even countries, all while protecting privacy and individual research project ownership. According to Opensource.org, open source software refers to any software that can be freely accessed, used, changed, and shared (in modified or unmodified form) by anyone, whereas Creative Commons states that open data is data that can be freely used, re-used and redistributed by anyone. These are not new concepts as much of today’s household technology was built on open source software. So were web servers and browsers, video and audio conferencing services, and the operating systems on smartphones and other home and office devices.4
Open licensing allows data scientists to share the data, code, and methodologies for their analysis without the need to establish formal DUAs, as the software licenses already account for that potential for anyone using the software. Some, using what is known as a “copyleft” license, even require that work created via open software or data is required to be open as well. There are, as with all licenses, nuances and complications to each open source license outside the scope of a brief blog post, but there are licenses designed for all purposes that allow for sharing data, code, and software freely, for commercial use, and/or with the stipulation of credit or acknowledgement. The more open the license, the more useful the data are to the public and the larger research community.
At its inception, there were many discussions surrounding the licensing and accessibility of the COVID-19 global map and the later work of the Coronavirus Resource Center since public data is at the core of our mission. The original U.S and global maps are constructed within the ESRI ArcGIS software platform, which was chosen in part because it is widely used and allowed our team to respond quickly at the beginning of the pandemic. An additional benefit is that this platform allows an easy medium of exchange for others to use our data layers within their platform. ESRI is considered an open platform, but not open source software simply because you need to purchase the initial ESRI license to then access all the data layers and projects. However, we made sure to go further and publish our raw data in our CSSE GitHub repository, covered under a Creative Commons Attribution 4.0 International (CC BY 4.0) license, which allows free use of the data as long as we are cited.
The other visualizations and analyses performed on the CRC website, such as the Demographics of COVID maps, are usually crafted using Plotly or D3, which are open source software. This means that anyone can download Plotly or D3, and then download the code we wrote to build these visualizations. This allows anyone anywhere in the world to replicate and scrutinize our analyses. Without open source software, concerned individuals and research groups would need to reach out to us to establish a DUA every time they wanted to use our code, replicate our visualizations, or repeat/extend the analysis. That would not be possible due to the traffic to the CRC. With the open source licensing, our work has been used hundreds of times across countries and platforms, allowing Johns Hopkins to become one of the most trusted names in pandemic data.
What if we took this approach for all public health data? The purpose of public health data and research should be to improve individual and societal well-being. Proprietary datasets behind paywalls and legal red tape hardly reach their full potential to help society, even if researchers want to share their data publicly. Universities are now stepping up to bridge this gap by establishing Open Source Program Offices (OSPOs) to help manage and maintain open source software licensing for data and code created at academic institutions.5 Through OSPOs, academic research can become more accessible and trustworthy as a global community of developers and users will have the ability to review and analyze it.
This concept could extend past academia to the greater community of data experts as this country and the world seek to rebuild and redesign their data systems following the strain put on them by the COVID-19 pandemic. Policymakers and funding agencies would not even have to endorse a specific open source software. As long as research was performed or funded under the caveat that it should be performed through open source software, data and analysis can be easily accessible to all. Additionally, if code or a program for a researcher’s needs does not already exist and they develop one, academic OSPOs now exist to establish the new software as open source for the greater data community to use.
We have been trying to work our way through the bureaucratic nightmare that is open data in a proprietary world. Open source software has the potential to circumvent all of the legal and technical complications provided by the current system. By utilizing open source software, researchers show that they are committed to the ethical and scientifically-sound ideal of public data. The public is better served by public data, and implementing wider use of open source software in a post-pandemic world can get us there.
For more information on how the libraries support researchers through OSPOs and other data services, please read our recent Q&A with Dr. Sayeed Choudhury.
Title image by Elchinator at Pixabay.