Clinical and Translational Science Institute

Infrastructure, consultation for multi-site studies using electronic health data

NIH-funded Observational Health Data Sciences Initiative links over 3,000 collaborators across approximately 80 countries with electronic health records data repositories that collectively contain over 900 million unique patient records

CTSI Informatics and BERD Core Faculty. Left to Right: Wenke Hwang, Vasant Honavar, Terrence Murphy Credit: Penn State. Creative Commons

UNIVERSITY PARK, Pa. — Advanced machine learning methods can now help predict and understand health risks and outcomes. These methods use large sets of clinical data, including electronic health records, socio-demographic data and medical imaging. Until recently, Penn State researchers had limited access to big biomedical and health research data, such as electronic health records (EHR). This has now changed with the establishment of the Penn State Digital Collaboratory for Precision Health Research (DCPHR), an initiative led by Penn State Clinical and Translational Sciences Institute (CTSI) and the Penn State Center for Artificial Intelligence Foundations and Scientific Applications (CENSAI).

DCPHR offers the infrastructure and research capacity to allow Penn State researchers to pursue collaborative data-intensive research projects using large clinical data sets and high-performance Artificial Intelligence/Machine Learning (AI/ML) data analytic workflows.

“DCPHR aims to significantly lower the barriers to collaboration between clinical and translational scientists at Penn State College of Medicine, and data scientists and AI/ML experts at other Penn State campuses, for pursuing data-intensive, AI/ML-powered biomedical and health research,” said Vasant Honavar, Dorothy Foehr Huck and J. Lloyd Huck Chair in Biomedical Data Sciences and Artificial Intelligence, director of CENSAI, and Penn State CTSI Informatics Core co-lead.

The DCPHR currently provides access to de-identified Penn State Health EHR data organized according to the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) developed by the NIH-funded Observational Health Data Sciences Initiative (OHDSI) (pronounced “odyssey”) consortium. The OMOP common data model powers many large-scale biomedical data science efforts such as the NIH’s National COVID Cohort Collaborative (N3C).

OHDSI access is the newest addition to the DCPHR initiated by the CTSI Informatics Core. OHDSI aims “to improve health by empowering a community to collaboratively generate the evidence that promotes better health decisions and better care.” OHDSI links over 3000 collaborators across approximately 80 countries with OMOP-based EHR data repositories that collectively contain over 900 million unique patient records (representing about 12% of the world’s population). Each member of OHDSI maintains EHR data for its patient population in an OMOP-based institutional EHR data repository, like the one maintained by DCPHR at Penn State.

Any institution that is a member of the OHDSI consortium can propose a multi-site study with a defined study protocol and solicit participation from collaborators across the OHDSI consortium.

Each site that joins the study executes the study protocol on its data and the results of analyses are pooled to answer the research questions targeted by the study. There is no need for participating institutions to share EHR data with other sites, which significantly lowers the barriers to multi-institutional collaborations.

“DCPHR allows Penn State researchers to be part of large multi-site studies in ways that were not previously possible,” said Avnish Katoch, research informatics project manager with Penn State CTSI.

Now, Penn State researchers can perform multi-site studies in collaboration with OHDSI. Penn State researchers interested in accessing the OHDSI community or developing proposals for multi-site studies should request a free informatics consultation.

In addition, CTSI’s Informatics Core can assist with study design, including use of AI/ML. The Informatics Core empowers researchers in several ways, including the following:

  • help with study design and feasibility analysis;
  • help with cohort definition and data extraction;
  • support for data preparation;
  • support for analysis of large data sets (characterization, prediction, effect estimation);
  • support for model interpretation, interrogation, deployment, inference; and
  • AI/ML support for research proposal development

As a proof of concept of the OHDSI community, Penn State participated in Project HERA - Health Equity Research Assessment in which investigators looked to characterize health and healthcare disparities across different groups, outcomes and databases/countries. Investigators used HERA to ask: Are there systematic patterns of diagnosis coding prevalence for Black and white patients across a network of observational health datasets and across all diagnoses? A publication of this study is currently underway.

“During the past year, the Informatics team completed testing the Penn State instance of OMOP-based EHR data repository, set up processes for its periodic refresh, assessed data for quality and completeness, and identified steps to improve data quality. The data repository recently transitioned from the test environment to the production environment, allowing us to open it up for use by the larger biomedical data sciences and clinical research communities at Penn State,” said Honavar. “The next milestone for DCPHR is to support the integration of EHR data with other data sets or individual level socio-demographic data, deidentification of the integrated data, and provision of AI/ML workflows for analyses of multi-modal health data,” he added.

Other informatics data information

Penn State CTSI informatics core provides access to Electronic Health Records (EHR) data from TrinetX, which includes 80+ institutional partners of the TrinetX research network. The TrinetX platform supports basic statistical analyses. Trinetx is better suited for preliminary analyses of large EHR datasets. Basic statistical characterization of TrinetX EHR data can be carried out using this platform whereas more extensive analyses, e.g., using machine learning, require retrieving the relevant data and running it through AI/ML pipelines (often with assistance from the CTSI Informatics Core’s data science team).

The CTSI Informatics Core

Artificial intelligence and machine learning are necessary for researchers who are interacting with large data sets. However, it can be challenging to understand how to best access and interface with these giant databases. Many research groups at Penn State are working through the CTSI Informatics Core to leverage data science methods to advance their work.

For more information on how the CTSI Informatics Core works, watch this replay of “Harnessing the Power of EHR Data and IA to Advance Biomedical Research,” which includes how and why current research groups have applied artificial intelligence to their research, and offers examples of how the computational consulting team can support Penn State researchers' data science projects. 

Last Updated April 15, 2024