HERSHEY, Pa.— One challenge researchers face is in accessing and analyzing big data, modern artificial intelligence (AI), and machine learning (ML) methodologies to answer important clinical research questions necessary for larger studies. Up until recently, large sets of data from biomedical and health research, such as electronic health records (EHR), were out of reach as there was no infrastructure that allowed researchers to interface with the data in a secure and seamless way.
Vasant Honavar, co-lead of Penn State Clinical and Translational Science Institute’s (CTSI’s) Informatics Core, and team launched the Digital Collaboratory for Precision Health Research (DCPHR). The DCPHR combines the efforts of CTSI’s Informatics Core and the Center for Artificial Intelligence Foundations in Scientific Applications. Together with the Institute for Computational and Data Sciences, the Social Science Research Institute, and the Health Spoke of the National Science Foundation’s Northeast Big Data Hub, the Digital Collaboratory provides access to these large data sets via several discovery tools and provides researchers with the necessary artificial intelligence and machine learning stacks to properly use those data sets.
A key goal of the DCPHR was to get Penn State Health electronic health records (EHR) ready for data-intensive research. Data needed to be standardized such that it complies with the common-data model and can be ready for multi-site EHR-based studies. Additionally, the basic infrastructure for AI/ML-enabled research needed to be implemented, so that access to and use of the data for researchers is policy-compliant, reproducible, scalable and shareable.
Wenke Hwang, co-lead of Penn State CTSI Informatics Core, focuses on the development and curation of Penn State Health EHR that conforms to data standards of the PCORnet’s Common Data Model. As part of a PCORI-funded team since 2015, he has worked extensively with research investigators and a technical team in the Penn State Information Technology to create a clinical research data repository called “PaTH to Health” data. The data splits all patient-level and encounter-level data into multiple tables using pseudo-identifiers in a HIPPA-compliant manner. The “PaTH to Health” data meet the national data standard and are harmonized with data from more than 70 PCORnet clinical sites. This data repository is refreshed regularly (currently every other week), checked for data quality assurance quarterly, and has been used as a data infrastructure that supports several successful multi-site grants applications.
In addition, Hwang is working to put the data in the hands of researchers in a timely and user-friendly manner. He has expanded the research repository to include several important domains that are not regularly covered within a common data model, such as pediatric data elements, mother-infant dyads, and neighborhood characteristics. He has worked with the investigators to use the repository for machine learning and predictive modeling for clinical and translational research. He has developed processes that link the PaTH to Health data repository with national mortality records, with prospective clinical data used in clinical decision support, and with non-AI/ML clinical research protocols including chart reviews and patient recruitment.
Enabling access to large data sets
The DCPHR currently provides access to data from two complementary sources. The first is deidentified Penn State Health EHR data standardized using the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) developed by the NIH-funded Observational Health Data Sciences Initiative (OHDSI) consortium. OHDSI consortium links 3,266 collaborators across approximately 75 countries with OMOP-based EHR data repositories that collectively contain 928 million unique patient records (representing about 12% of the world’s population). Any institution that is a member of the OHDSI consortium can propose a multi-site study with a defined study protocol, and solicit participation from collaborators across the OHDSI consortium. Each site that joins the study runs identical analyses on its data and the results of analyses are pooled to answer the research questions targeted by the study.
The second way to access EHR data is through TriNetX. TriNetX allows Penn State researchers to access Penn State Health EHR data, de-identified data from the TriNetX Research network (from over 71 health care organizations), de-identified claims data from the Diamond Network (from 92 organizations), and de-identified claims data from the COVID-19 Research Network (from 78 additional organizations). Researchers can define study cohorts of interest, by querying TriNetX networks based on medications, diagnoses, demographics, lab results, genomics, mortality, oncology, procedures, etc.
“Penn State is known for its broad and deep expertise in data science, artificial intelligence and informatics,” said Avnish Katoch, research informatics project manager with Penn State CTSI. “We needed a way to leverage the full expertise of Penn State to answer clinical research questions."
“DCPHR aims to significantly lower the barriers to collaboration between clinical and translational scientists at Penn State College of Medicine, and data scientists and AI/ML experts at University Park and other campuses, for pursuing data-intensive, AI/ML-powered biomedical and health research,” said Honavar, Dorothy Foehr Huck and J. Lloyd Huck Chair in Biomedical Data Sciences and Artificial Intelligence and Director of Penn State Center for Artificial Intelligence Foundations and Scientific Applications.
The CTSI Informatics Core Offers Support on AI-Powered, Data-Intensive Health Research
The CTSI Informatics Core empowers researchers in several ways. Not only do researchers get access to the Penn State instance of OMOP for pilot studies and access to the TriNetX system, they also can receive:
- Help with study design and feasibility analysis.
- Help with cohort definition and data extraction.
- Support for data preparation
- Support for analysis of large data sets (characterization, prediction, effect estimation)
- Support for model interpretation, interrogation, deployment, inference.
- AI/ML support for research proposal development
The CTSI Informatics Core partners with researchers to help with any aspect of AI/ML based analyses of EHR, claims, or even other large clinical data sets.
“If you’re a clinical researcher with a well-formulated clinical question that you’d like to answer using one of the data sets mentioned, we [the computational consulting team] are happy to collaborate. We have a high-performance computing infrastructure and the necessary software stack in place. We can retrieve the data, ingest and clean it, and run it through our AI/ML analytical pipelines,” said Justin Petucci, R&D Engineer, Research Innovation with Scientists and Engineers Team of ICDS, who lends his services to the CTSI Informatics Core.
Petucci has worked closely with Honavar, Katoch and clinical researchers on several projects, including:
- a multisite study of health disparities across different races using EHR data from 8 million patients in the United States;
- predicting the mortality attributable to cancer and non-cancer-specific causes using a cohort of over 1.4 million cancer patients from the US National Cancer Database;
- predicting the 30-day clinical outcomes for patients with COVID-19, with and without peripheral artery disease (PAD) using a large cohort of patients from TriNetX Research Network; and
- improving the accuracy of heart disease risk prediction using EHR data.
Advances in AI, along with the increasing availability of large data sets, offer unprecedented opportunities to revolutionize biomedical and health research,” said Honavar. “Realizing the promise and potential of AI to improve individual and population health outcomes, inform health policy, and reduce health disparities, requires state-of-the-art data and computational infrastructure, advanced AI/ML expertise and tools, interdisciplinary collaboration between AI/ML experts and biomedical, clinical and health researchers, and eventually, training a new generation of AI/ML savvy clinicians and clinical researchers,” he added.
Looking ahead
The Penn State CTSI Informatics Core is excited to begin moving DCPHR from pilot mode to production in order to support Penn State researchers interested in:
- AI-ML-powered data-intensive biomedical, clinical and translational research;
- integrating other data (e.g., sociodemographic, environmental and eventually genomic data) with EHR and claims data;
- fostering an interdisciplinary Biomedical AI community at Penn State through workshops and ideas labs (e.g., in collaboration with CTSI, Penn State Center for Artificial Intelligence Foundations and Scientific Applications, and the College of Medicine’s Clinical Informatics Division and AI Initiative, and ICDS).
“I’m grateful for the leadership of our CTSI Informatics Core and their engagement of our outstanding collaborators,” said Jennifer Kraschnewski, director of Penn State CTSI. "Their important efforts have leveraged expertise across our university to bring the power of AI and ML to the opportunities with our EHR to take clinical and translational sciences into the future.”
Additional information
For more information and to submit a research services request through Penn State CTSI, visit the website.
Artificial intelligence and machine learning are necessary for researchers who are interacting with large data sets; however, it can be challenging to understand how to best access and interface with these giant databases. Many research groups at Penn State are working through the CTSI Informatics Core to leverage data science methods to advance their work. For more information on how the CTSI Informatics Core works, watch this replay of “Harnessing the Power of EHR Data and IA to Advance Biomedical Research,” which includes how and why current research groups have applied artificial intelligence to their research, and offers examples of how the computational consulting team can support your data science project.