searchSearch data by region...
Expert Insight

Grassroots Fervor for Large-Scale Data Collaboration

Collaboration is key to scientific and medical research. However, there were many obstacles for collaboration on COVID-19 studies that resulted in fragmented data collection and redundant studies asking the same questions.

Joshua E. Porterfield, PhD
July 21, 2021

Collaboration is one of the greatest strengths of the scientific community; however, the COVID-19 pandemic exposed multiple obstacles to it, from bureaucratic red tape to academic self-interest. Dr. Betsy Ogburn, a professor of biostatistics in the Bloomberg School of Public Health and founder of the COVID-19 Collaboration Platform, reflects on the academic and policy environments surrounding COVID-19 research and draws a road map for ensuring improved collaborative data efforts in the future.

How does data from clinical trials benefit from aggregation?

Aggregating data across clinical trials is important for several reasons. Many clinical trials are not powered for the question of interest, especially with novel medical treatments—or even a novel disease!—where you don’t know what kind of effect size you’re looking for. It's a general policy not to perform underpowered trials, but it happens all the time. So most clinical trials are too small to answer the questions they're trying to answer, and aggregating data across multiple trials increases the sample size and gives us more reliable information about the scientific questions of interest.

The second reason is that aggregating data across trials that are answering the same scientific question reduces the probability of false positives. The global scientific community was trying to answer the same questions at the same time, especially at the beginning of the pandemic. At one point there were dozens of trials across the globe that were all trying to figure out if hydroxychloroquine was an effective treatment for COVID-19. With standard statistics, even if each one of those trials were adequately powered, you would expect five out of 100 trials to give you a false positive, and those are the studies that are going to be published while the other 95 “failed” trials are relegated to a file drawer. Analyzing these parallel trials in tandem reduces the possibility of false positives.

“If you aggregate data, you are more likely to be able to support detailed questions and conclusions.”

Another benefit of aggregating data across many trials is that you can get a more diverse and representative sample. With aggregated data from many trials, you will likely be able to generalize conclusions to more populations of interest and ask more refined scientific questions to drill down into demographic or disease-related subgroups. Researchers often try to drill down using data from a single trial, but it’s rare that a single trial would have adequate sample size for such analyses and there’s always the risk of fishing for signals and finding false positives.

What barriers to robust data sharing did the COVID-19 Collaboration Platform identify?

I stand by the idea behind the COVID-19 Collaboration Platform, but we faced so many challenges getting researchers to share their data. We have aggregated data for three research questions, which is a success, but not nearly the impact I had hoped for. From that work, I have identified three major barriers to collaboration: logistical hurdles, effort conservation, and perverse incentives to not share data. Logistical hurdles encompass the bureaucratic red tape at all institutional and government levels. Even without that red tape, it would still take extra effort to share data from all COVID-19 clinical trials, and researchers were spread too thin—especially during the early months of the pandemic, when the PIs of many clinical trials were also the front-line medical workers treating COVID patients. Those two issues are somewhat surmountable in the future.

However, incentives in the academic research setting align against sharing data. Since the most valuable research output is a first-author paper, it's self-sacrificial to forgo that authorship for the benefit of public health and contribute data to a bigger project where you are one of maybe hundreds of authors. That sacrifice has ripple effects through researchers’ careers, especially young researchers. It makes it harder to get grants to fund your research in the future if you don't go after first-author publications. It makes it harder to get your future papers published. It makes it harder to get promoted within your institution.

“In public health, the core purpose of research is to save lives and the incentives are aligned against that very outcome.”

How do we fix the system to promote collaboration and improved research?

It's a collective action problem that I am not equipped to solve. I think funding organizations should change the way they evaluate the outputs from research. It would be pretty easy to create systems that evaluate funded research on how much it contributed to public health, as opposed to how much it contributed to first-author publications. The government’s answer to centralized clinical trial aggregation is, but it does not go far enough requiring funded research to be made publicly available, especially for time-sensitive medical, clinical, and public health questions.

Data should be shared for the public good, especially if its collection was publicly funded. But incentive structures need to change so researchers can be confident that their careers won't suffer if they share their data instead of hoarding it for first author papers.

Have there been any successes with data collection and aggregation during the pandemic?

There were so many grassroot successes, including the Johns Hopkins Coronavirus Resource Center. There were academics who worked to get mobility data from tech companies and made it available for research. There were many spur-of-the-moment collaborations between municipalities, counties, and research groups. I know anecdotally of a group of researchers who didn’t sleep for a week so they could build a predictive model for hospital capacity for a county Department of Health. Most of the failures are at a higher level with the bodies that should be in charge of making sure that disparate research efforts coordinate with one another to get the general population the best possible information. As we look to the future, my feeling is that there needs to be a top down effort to coordinate research for global public health emergencies; bottom up efforts won’t cut it.

Joshua E. Porterfield, PhD

Dr. Joshua E. Porterfield, Pandemic Data Initiative content lead, is a writer with the Centers for Civic Impact. He is using his PhD in Chemical and Biomolecular Engineering to give an informed perspective on public health data issues.