Skip to main content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.


The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Training Data for Machine Learning to Enhance Patient-Centered Outcomes Research (PCOR) Data Infrastructure

Creating a Foundation to Advance the use of Artificial Intelligence for Patient Centered Outcomes Research and Clinical Practice
  • Office of the National Coordinator for Health Information Technology (ONC)
  • National Institutes of Health National Library of Medicine (NLM)
Start Date
  • 8/1/2019
  • Use of Clinical Data for Research

STATUS: Completed Project


Artificial intelligence (AI) and associated innovative technologies like machine learning have the power to consume large amounts of data in varied, complex formats to identify effective treatments more quickly, potentially accelerating clinical innovation by speeding up the research lifecycle and application of evidence in clinical settings. Industry experts have acknowledged that large amounts of high-quality training data are a critical part of the foundation that will support researchers’ use of machine learning to accelerate the discovery of novel disease-outcome correlations and associations, and inform the design of prevention and treatment studies.

High-quality training data sets that are well-labeled and structured, use common data models and common data elements annotated by domain experts, and combine previously unconnected data resources that can be used to train algorithms to elucidate knowledge and extract relevant data points for research. This project curated high-quality training data sets on two use cases: (1) kidney disease, which ONC led together with the National Institutes of Health and (2) drug resistance in patients infected with tuberculosis, which was implemented by NLM with the National Institute of Allergy and Infectious Diseases (NIAID). The lead agencies used these training data sets to develop, train, and improve algorithm performance. The project also developed and disseminated papers and two final reports that discuss the current strengths and limitations of AI for patient-centered outcomes research (PCOR), industry, and the Department of Health and Human Services (HHS). It also included a forward-looking section that provides an initial high-level blueprint identifying the potential for HHS to use AI in discovery, safety surveillance, and addresses key issues facing the people served by its programs (e.g., Medicare, Medicaid).


This project enhanced the capacity of PCOR researchers to use machine learning by developing and disseminating a number of resources that present not only training data and methods but also lessons learned. Evidence generated from this application of AI can support multiple federal and HHS investments in precision medicine, kidney, and tuberculosis research programs so providers can match patients to the best treatments based on their specific health conditions, life experiences, and genetic/phenotypic profiles.

Project Objectives:

  • Develop high-quality training data sets and capture lessons learned from best practices in data annotation and curation, and compile insights on the data quantity and quality requirements for machine learning as applied in PCOR.
  • Develop machine learning algorithms that will be trained and tested on the curated data sets.
  • Develop implementation guides detailing each method used and the generic aspects of the data that each method leverages, with detail sufficient to facilitate its application to a wider array of use cases.
  • Disseminate tools, training data, and lessons learned to stimulate the application of these methods to a wider array of use cases by PCOR researchers.


  • The project team developed and validated high-quality training data sets that can be used to support patient-centered care and shared decision-making regarding patient preferences and treatment decisions.
  • The project team successfully trained the machine learning models to distinguish between drug-resistant and drug-sensitive tuberculosis in radiographs.
  • The team used chronic kidney disease as a use case to demonstrate how the training data sets can be utilized in AI and machine learning applications for PCOR.
  • The project documented lessons learned, best practices, and recommendations, which can support future applications of machine learning to patient-centered outcomes research.



  • The ONC team produced a final report that summarizes project activities performed by ONC, including the methodology employed to build the training dataset, machine learning models, and use case selection, as well as recommendations to support the ongoing use of the resources. The report is available here:
  • The NLM team produced a final report that describes project activities performed by NLM, discusses key deliverables, and presents lessons learned. The report is available here:
  • The first use case (kidney disease) project website and infographic explain project goals, objectives, and activities.
  • The team produced training data sets and validated machine learning algorithms using data from the United States Renal Data System (USRDS). Researchers can request access to the datasets via the USRDS website.
  • The team produced a training data set and validated machine learning algorithms for the tuberculosis drug resistance use case. These data are available for download on the NIH National Institute of Allergy and Infectious Diseases (NIH/NIAIDS) TB Portals, available here:
  • The team produced open-source codes for the two use cases:
  • The team developed an implementation guide for the first use case (kidney disease), “Training Data for Machine Learning to Enhance PCOR Data Infrastructure,” that provides the methodology for and lessons learned from developing effective, high-quality training datasets and machine learning models. The implementation guide is accompanied by a project overview and a data dictionary.
  • The team developed a second implementation guide for a second use case (tuberculosis), “Training Data for Machine Learning to Enhance Patient-Centered Outcomes Research (PCOR) Data infrastructure- A Case Study in Tuberculosis Drug Resistance”.


  • The project team published seven manuscripts covering topics relevant to PCOR researchers:
    • “Clinical and radiological features of novel coronavirus pneumonia,” published in The Journal of X-Ray Science and Technology, available here:
    • “Developing and verifying automatic detection of active pulmonary tuberculosis from multi-slice spiral CT images based on deep learning”, published in The Journal of X-Ray Science and Technology, available here:
    • “Differentiating between drug-sensitive and drug-resistant tuberculosis with machine learning for clinical and radiological features”, published in Quantitative Imaging in Medicine and Surgery, available here:
    • “Identifying Drug-Resistant Tuberculosis in Chest Radiographs: Evaluation of CNN Architectures and Training Strategies”, published by the IEEE Engineering in Medicine & Biology Society, available here:
    • “Generalization Challenges in Drug-Resistant Tuberculosis Detection from Chest X-rays,” published in Diagnostics (Basel), available here:
    • “Automated Drug-Resistant TB Screening: Importance of Demographic Features and Radiological Findings in Chest X-Ray”, published by the IEEE Engineering in Medicine & Biology Society, available here:
    • “Combining Radiological and Genomic TB Portals Data for Drug Resistance Analysis” published in IEEE Access, available here:



Below is a list of ASPE-funded PCORTF projects that are related to this project

A Synthetic Health Data Generation Engine to Accelerate Patient-Centered Outcomes Research – Synthetic data can be used to initiate, refine, or test innovative research approaches more quickly. This project addressed the need for research-quality synthetic data by increasing the amount and type of realistic, synthetic data that the Synthea™ software program can generate. Synthea™ is an open-source software program that creates high-quality, clinically realistic, synthetic patient health records in large volumes. ONC leveraged its expertise as a coordinator by bringing together a technical expert panel to assist in the development of five use cases for new module development from three categories: opioids, pediatrics, and complex care needs.