Optimizing Electronic Health Records for Large-Scale AI Applications and Analyses

Objective: to optimize the data for use in AI applications, ensuring that the data is as accurate, complete, and consistent as possible.

Method: a pipeline to (1) assess data quality for EHR in terms of its heterogeneity, incompleteness, and inconsistency; (2) improve data quality for EHR by addressing the identified issues; (3) visualization; (4) safeguard


Overview of the potential of AI in healthcare and medical research

The expanding role of AI in transforming healthcare practices and medical research.

  • Improvements in diagnostics and treatment planning.

  • Enhancements in patient outcomes through predictive analytics.

  • Contributions to personalized medicine and efficient healthcare delivery.

Examples illustrating the significant impacts of AI applications using EHR

Importance of high-quality EHR data

The critical role of accurate, comprehensive, and AI-readable EHR data in developing effective AI tools.

Challenges presented by the inherent complexities of EHR data.

  • Data heterogeneity

    • Variability in data formats and coding standards.

    • The impact of granular differences across healthcare institutions.

  • Data incompleteness and inconsistency

    • The prevalence of missing data and erroneous entries.

    • The lack of standardized protocols for data recording and entry.

  • Need for quality assessment

    • The absence of standardized methodologies for assessing EHR data quality.

    • The challenge in utilizing EHR data confidently for AI applications due to quality concerns.

The contribution of this study as a pipeline for data quality assessment and improvement
  • Data quality assessment module

    • Utilization of code co-occurrence analysis.

    • Assessment of embedding quality.

    • Implementation of additional metrics for a systematic evaluation.

  • Data quality improvement module

    • Techniques for normalization and enhancing data completeness.

    • Adjustments of conditional probabilities to accurately reflect healthcare utilization.

  • Summary of contribution

    • The implications of improved EHR data quality for healthcare AI applications.

    • Enabling more accurate and reliable AI applications.

    • Facilitating innovative medical research and improved healthcare delivery through AI.


Data and use cases

List of healthcare institutions and use cases involved: MGB (RPDR and Biobank), BCH, UPMC (AD and MS), VA, RAprod5 (not embedding quality, but may add as use, depends).

Overview of the method

A multifaceted pipeline to assess and improve the quality of EHR data for AI applications, which systematically addresses the complexities and challenges inherent in EHR data.

The multifaceted modules move from assessing and identifying quality issues, to implementing targeted enhancements, to supporting these efforts with tools for visualization and community engagement, all while maintaining a strong ethical framework.

Data quality assessment module

This module identifies specific areas where EHR data quality falls short, providing a targeted approach for subsequent enhancement efforts. It serves as the basis for all quality improvement initiatives.

  • Code co-occurrence analysis: analyze the relationships between different medical codes to assess data consistency and completeness.

  • Embedding quality check: evaluate the semantic quality of data embeddings for accurate data representation.

  • Standardization and ontology mapping: assess the extent of data mapping to common ontologies, identifying gaps and inconsistencies.

  • Missing data identification: identify and document missing data patterns to prioritize areas for improvement.


Data normalization and standardization module

This module normalizes data and enhances its usability for AI applications by ensuring consistent and accurate data interpretation across diverse datasets and institutional practices.

  • Data harmonization: apply methodologies to normalize data representations and coding practices across EHR systems. (UMPC, VA, use NER to map; describe what we did for KESER, and also cite Doudou’s MIGKI and other pre-trained language model, multi-source model fused model)

  • Semi-automated mapping

Data enhancement module

This module enhances data completeness and accuracy, which directly impacts the performance of AI applications, leading to more reliable and meaningful outcomes.

  • Completeness enhancement: implement strategies to fill in missing data and correct inaccuracies. Unit (Doudou’s preliminary results, using RAProd5, use quantile, see if the quantile can match, or there is constant shift to adjust the unit; cite MICE, check with Molei and Weiying’s paper )

  • Bias correction: adjust data analyses based on healthcare utilization rates to prevent bias. Temporal bias (Jue Hou, oncology, temporal bias, standard care over calendar year temporal shift, cite JAMA network OPEN)

Visualization and tool development module

This module empowers researchers and practitioners to continuously assess and enhance data quality, fostering a culture of quality improvement within the community. (PARSE, ask for Hui)

  • Data quality visualization tools: develop interactive tools that allow users to visualize data quality metrics and issues.

  • Community tools and resources: create and disseminate tools (e.g., the ONCE app) that enable broader community engagement in data quality improvement.

Ethical considerations and data governance module

This module maintains ethical standards and protecting patient privacy are paramount, ensuring trust and integrity in AI applications developed from EHR data.

  • Privacy protection measures: implement and enforce strategies to protect patient data privacy.

  • Ethical use guidelines: develop guidelines for the ethical use of EHR data, ensuring compliance with legal and ethical standards.


Data quality assessment outcomes
  • Code co-occurrence results

    • Summary of findings on data consistency and completeness across institutions.

    • Identification of significant inconsistencies or gaps in medical code usage.

  • Embedding quality check findings

  • assessmentssment outcomes on the semantic quality of data embeddings and their accuracy in representing medical concepts.

  • Standardization and ontology mapping

    • Overview of gaps and inconsistencies in data mapping to common ontologies.

    • Examples of how standardization efforts improved data interoperability.

  • Missing data identification

    • Statistics on missing data patterns discovered across different datasets.

    • Prioritization of areas for improvement based on missing data analysis.

Data normalization and standardization impact
  • Data harmonization results in their accuracy performance

  • Ontology alignment results in their accuracy performance

Data enhancement achievements
  • Completeness enhancement

    • Quantitative and qualitative examples of how data completeness was improved
  • Utilization adjustment effectiveness

    • Analysis of how utilization adjustments corrected biases in data analyses, with examples demonstrating the before-and-after differences.
Visualization and tool development
  • Data quality visualization tools

    • Case studies on the usage of visualization tools in identifying and understanding data quality issues.
  • Community tools and resources

    • Examples of community adoption of tools like the ONCE app and the impact on data quality improvement efforts.
Ethical considerations and data governance improvements
  • Privacy protection measures

  • Ethical use guidelines

Overall impact on AI applications and use cases
  • Detailed overview of how the improvements in EHR data quality impacted the specific use cases (MGB, BCH, UPMC, etc.).

  • Comparative analysis or case studies highlighting the before-and-after scenarios for AI applications using the enhanced EHR data.


  • Briefly summarize our pipeline and our impact

  • Important findings and insights

  • Limitations and future directions