searchSearch data by region...
Expert Insight

Q&A: Library Data Management Leads Research Out of a Digital Dark Age

The Sheridan Libraries’ crucial role in the Johns Hopkins COVID-19 global map highlights the importance that libraries play in the data missions of universities. Library data management services reduce burdens on researchers, improve public access to data, foster collaborations, and maintain the long-term integrity of datasets.

Joshua E. Porterfield, PhD
October 27, 2021

University libraries aren’t what they used to be. Most are going far beyond print media collections and research article hosting services. The Sheridan Libraries have proven to be a vital partner in buttressing the Johns Hopkins COVID-19 global map, proving that libraries have emerged as primary stewards of data that continues to expand in scale and complexity, says Dr. Sayeed Choudhury, Associate Dean for Research Data Management at the Sheridan Libraries. The library provides software services and training, data storage and archiving, data management, technical support, and public-facing customer service, removing many burdens from researchers.

Can you describe the role the library plays in supporting the global COVID-19 map?

The library first became involved because we provide the support for the institution’s license with ESRI, the vendor of the software used for the COVID-19 dashboard. The library had to scale-up that service and the use of that technology by amending the license. ESRI’s software was never designed for this scale, so we had to upgrade the infrastructure on our side and determine what implications that might have in terms of supporting it. It was unprecedented. It's almost like someone had a small research lab, and then suddenly it was run on an Amazon-like scale. Making that jump in scale is a very important part of what we did.

“I work with really great people. This phenomenon was all new to them, but they dove right in and made a huge impact.”

Early on, we also provided the hosting services for the scripts and the tools that were being used to process the data. The team at the Applied Physics Laboratory has been phenomenal to work with, but they were not initially set up to work with GitHub and the cloud in the open. The library picked up the initial hosting services, servers, and infrastructure to get the site running. As with any large-scale data processing or software engineering activity, in the early days we made mistakes, learned, and evolved. The library was also on-call 24/7 to handle scripts breaking or pipelines shutting down.

Our daily engagement has waned, and now we are thinking about what it means to build an archive around this endeavor. The library is a good place to keep data for the long-term. Libraries have gone beyond just depositing data for later review, to thinking of all the ways people use the data, tools, visualizations, services, and programs essential to the research. It isn't just keeping the data in GitHub or taking the data out of GitHub and putting it in a library archive. It's about capturing the story of what happened. Sadly, this will not be the last pandemic, so we need to create this archive in a way that can best support the next team that is the lead for providing this expertise, advice, and guidance for the world.

“Data has become such an important piece of a bigger story. It's a substrate that ties together so many different uses, people, and activities.”

How do you factor in public accessibility when archiving data?

The key for the library is becoming a clearinghouse, not a gatekeeper, of public data. We have a data archive that's public, so once data are there, people can download it, cite it, point to it in grant applications, etc. There's no reason why that can't be open to everyone. Once data are made public, people will have questions and want answers. If the public becomes very interested in something, they should contact us. We can answer baseline questions and then forward pertinent questions back to the researchers. The library increasingly sees itself as playing that role of handling requests that come in from the public so that researchers are not overwhelmed and inundated when they make data public.

How do the library’s data management services support researchers?

There are projects coming to the end of their lives and researchers who have lots of data. That data are not trivial for researchers to manage on their own. Ignoring the public aspect of this data, the library can help researchers share their data with themselves over time. If you ask professors if they can find and use their own lab’s data from five years ago, many of them can’t. Institutional support reduces the need for researchers to have that burden of carrying and recollecting all of their data over time.

This is where software is becoming so important. What many institutions and agencies are realizing now is data without the code necessary to interpret it is an incomplete approach. You have to have the code and the data in order to participate in the ongoing research dialog with peer review, counter arguments, and new discoveries. If you’re inviting the rest of the research community to build on your research and go forward as a community, we want to make sure other people can understand and use your data. The library can assemble and maintain the code, data, and metadata, serving as better stewards of data than filing cabinets, cardboard boxes, and unlabeled hard drives.

How do we connect systems to support data sharing across institutions?

One of the things I like about libraries is cooperation. There's absolutely this belief that libraries as a network are stronger by working together than if we compete with each other. Our belief has always been that our collective collections are an incredible resource, and we have to find ways to share them with each other. I think that'll happen with data just as it has with books through services like interlibrary lending. There are service and technology elements that may not exist with print materials.

“We are better together than we are individually. We're already starting to think about how to tap into each other's expertise and service.”

We are members of the Data Curation Network, which is about 13 institutions that share expertise. If the University of Minnesota gets a request to curate data where they don't have a local expert, but Hopkins does, we will share that information and knowledge. Then when Minnesota has expertise we don't, we can tap into it. We will need to be intentional and strategic about curating our data collection and expertise, just as we've been with our print collections. We're still in the early stages of depositing and managing our own researchers’ data long-term, but by doing that we're starting to understand Hopkins’ specific data needs. As other institutions do the same I think we'll start mapping those profiles to each other.

Joshua E. Porterfield, PhD

Dr. Joshua E. Porterfield, Pandemic Data Initiative content lead, is a writer with the Centers for Civic Impact. He is using his PhD in Chemical and Biomolecular Engineering to give an informed perspective on public health data issues.