UNIVERSITY PARK, Pa. — The U.S. National Science Foundation (NSF) National Synthesis Center for Emergence in the Molecular and Cellular Sciences (NCEMS) at Penn State will “bring scientists together from different disciplines to integrate diverse data sets to answer transformative scientific questions,” according to Justin Petucci, associate director of NCEMS and Research Innovations with Scientists and Engineers' (RISE) artificial intelligence and machine learning team lead.
The center, which was announced in April, is supported by a $20 million grant from NSF, and will be housed at University Park and managed by Penn State.
The core mission of a synthesis center like NCEMS is to reuse existing publicly available data, rather than generate new experimental data. And with over 100 Petabytes of data publicly available across the globe, the opportunities for discovery are many. NCEMS was inaugurated in May and has 11 members on their leadership team.
“There’s a massive amount of molecular and cellular data available and open questions that have the opportunity to be addressed applying computational and data science techniques to large integrated datasets,” Petucci said. “The center will build a nationwide community, including the formation of working groups consisting of scientists from different disciplines that will be supported by NCEMS.”
To support community-scale synthesis research — which are research projects beyond the capabilities of individual labs to carry out — the team will address four main barriers to progress, according to NCEMS Director Ed O’Brien.
“The leadership team, in combination with the RISE team of engineers at the Institute for Computational and Data Sciences (ICDS) will address the data, methods, team and collaborative challenges that hold back progress in synthesizing these diverse data sets,” said O’Brien.
According to O’Brien, professor of chemistry and ICDS co-hire, a key challenge in integrating this diverse data is the highly specialized working knowledge needed to process, analyze and interpret the data generated from different experimental techniques.
“Combining, for example, mass spectrometry data and next-generation sequencing data requires specialized workflows to go from raw to processed data,” O’Brien said. “You need trained experts to do that right. Individual labs often do not have the resources to handle
such a wide diversity of data. NCEMS will centralize this expertise as a national resource to overcome this challenge.”
With so much diverse data to be integrated, their analysis requires statistical methods to correctly interpret the data and minimize false positives that can easily arise if not handled correctly, according to O’Brien. Further, bringing to bear methods including machine learning and theoretical modeling can be another barrier for individual labs.
“We are going to provide resources and training to graduate students and post-docs to use these diverse methods to carry out the research on these big data sets,” O’Brien said.
The center will also support catalyst meetings to explore and identify potential synthesis questions that could form the basis of a working group.
“We want to make transformative discoveries,” O’Brien said. “The best way to do this is to bring diverse scientific perspectives together to drive new ideas. We are providing resources to help teams form and go after these questions, and we are giving them the tools to research effectively. In an individual lab, scientists may not have access to a comparable network of scientists NCEMS will create.
To combat these challenges, network and research, the center will need collaborative infrastructure.
“The way we are addressing this is through CyVerse, an open science infrastructure platform funded by NSF that allows teams of scientists to share data and information in a uniform environment,” O’Brien said.
The University of Arizona’s CyVerse initiative is considered the “world’s largest publicly funded open-source cyber infrastructure for life sciences,” according to an article from Penn State News.
The ICDS RISE team will be involved in these groups to ensure needs are met. Post-docs, while independent, can also define their own projects and work within groups, get feedback and have access to resources, according to Petucci.
Other key components to NCEMS’s mission includes developing innovative research and analytical strategies, testing novel organizational models with open science principles and training the future workforce.
“This is super exciting to be a part of… to be supporting this type of research,” Petucci said. “NSF only funded one of these centers and Penn State got it. It really is a privilege to grow this national resource that can answer foundational questions.”