Skip to main content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.


The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

A Synthetic Health Data Generation Engine to Accelerate Patient-Centered Outcomes Research

Providing PCOR Researchers with a Low Risk, Readily Available Synthetic Data Source
  • Office of the National Coordinator for Health Information Technology (ONC)
Start Date
  • 4/1/2019
  • Use of Enhanced Publically-Funded Systems for Research


STATUS: Active Project


High quality health and health care related data are often difficult to access because of cost, patient privacy, or other legal and intellectual property restrictions. To protect patient privacy, researchers and developers often depend on anonymized data to test theories, data models, algorithms, or prototype innovations. However, the risk of re-identification of anonymized data is high and has been impossible to completely eliminate especially with rare conditions. Further, due to a variety of interoperability issues, it is often difficult to bring data together from different resources for the purpose of robustly testing analysis models, algorithms, or assisting in the development of software applications. Synthetic data can be used to initiate, refine, or test innovative research approaches more quickly. This project proposes to address the need for research-quality synthetic data by increasing the amount and type of realistic, synthetic data that the Synthea software program can generate. Synthea is an open source software program that creates high quality, clinically realistic, synthetic patient health records in large volumes.

ONC will leverage its expertise as a coordinator by bringing together a technical expert panel (TEP) to assist in the development of five to seven priority use cases for new module development from three categories; opioids, pediatrics, complex care needs. To ensure that the modules generate data fitting the needs of patient-centered outcomes researchers, the TEP will include representation from relevant researchers so they can provide input regarding the design of a given module. Some initial applications of the generated synthetic data include identification of effective prevention methods, treatments or interventions, such as assessments related to controlled substance prescriptions, that reduce the impact of mental and substance use disorders. Similarly, patient-centered outcomes research (PCOR) researchers may use the synthetically generated data from the other module categories to simulate care interventions, analyze longitudinal patient progress, and potentially incorporate patient reported outcomes.


A synthetic data engine is a potentially important piece of the greater PCOR data infrastructure because it provides PCOR researchers with a low risk, readily available synthetic data source complementing their use of real clinical data and enhancing their ability to conduct rigorous analyses and generate relevant findings that can inform health and treatment decisions. 

This project will address the following objectives:

  • Enhance Synthea by developing or updating five to seven data generation modules for opioid, pediatric, and complex care use cases to increase the number and diversity of synthetic patient health records.

  • Administer a prize competition (“challenge”) to encourage researchers and developers to validate the realism of the generated synthetic health records.

  • Support awareness and use of Synthea including its updated modules, module builder and the generated synthetic data through various dissemination mechanisms.