Great data is the foundation of great health data science

Limber up with our curated list of useful datasets for research

Here are some sources of robust, diverse datasets we think you might help with your training. More user reviews coming soon.

  • AHRQ Social Determinants of Health Database

    The Agency for Healthcare Research and Quality (AHRQ) has assembled data related to social determinants of health from multiple federal sources into one data source. Variables include: Social context (e.g., age, race/ethnicity), economic context (e.g., income), education, physical infrastructure (e.g, housing), and healthcare context (e.g., health insurance).
    —Access: Open
    —Specialty: Geospatial & Social Determinants of Health
    Data Dictionary
    —Funder: Patient Centered Outcomes Research (PCOR) Trust Fund

  • All of Us

    The ongoing NIH-funded All of Us research initiative aims to build the largest and most diverse repository of health, lifestyle, genetic and environmental information, on more than 1,000,000 individuals in the US.
    —Access: Controlled
    —Specialty: Biochemistry & Molecular Genetics
    Publications
    Data Dictionary
    Data Profile/Explorer
    —Funder: US National Institutes of Health (NIH)

  • Big Data Catalyst

    BioCatalyst is a data science platform funded by the National Heart, Lung, and Blood Institute (NHLBI) to provide access to multiple NHLBI data sets and analytic tools. Authorized researchers have access to 3.42 petabytes of data, including the Trans-Omics for Precision Medicine (TOPMed) study and database of Genotypes and Phenotypes (dbGaP).
    —Access: Controlled
    —Specialty: Biochemistry & Molecular Genetics
    Publications
    Data Dictionary
    —Funder: US National Institutes of Health (NIH)


  • biobank logo

    The UK Biobank includes genetic and health information on over half a million UK participants, including genetic, biomarkers, imaging, EHR, and questionnaire data.
    —Access: Controlled
    —Specialty: Biochemistry & Molecular Genetics
    Publications
    Data Dictionary
    —Funders: Established by the Wellcome Trust medical charity, Medical Research Council, Department of Health, Scottish Government and the Northwest Regional Development Agency

  • Cancer Imaging Archive logo

    The Cancer Imaging Archive, hosted by Frederick National Laboratory for Cancer Research, contains collections of de-identified medical images of cancer. Images are available in DICOM format.
    —Access: Mixed
    —Specialty: Pathology
    Publications
    —Data Dictionary: See individual collections
    —Funder: National Cancer Institute

  • EchoNet-Dynamic

    The Echnonet-Dynamic data set includes over 10,000 de-identified echocardiography videos with labeled measurements, tracings and calculations.
    —Access: Registration required
    —Specialty: Cardiology
    Publication
    —Funder: Stanford University
    User Review

  • MIMIC-IV

    The Medical Information Mart for Intensive Care (MIMIC) database includes information on over 40,000 patients at the Beth Israel Deaconess Medical Center, from 2008-2019. The data tables include structured EHR fields (diagnoses, procedures, labs) and notes from the critical care unit.
    —Access: Controlled
    —Specialty: Pulmonary and Critical Care
    Publications
    Data Dictionary
    —Funder: US National Institutes of Health (NIH)
    —Host: MIT Laboratory for Computational Physiology

  • National COVID Cohort Collaborative

    NC3 has aggregated EHR data from over 12 million individuals, including more than 4.5 million COVID-positive patients, and those with SARS 1, MERS and H1N1. Data includes diagnoses, procedures, medications, labs, and other structured fields.
    —Access: Controlled
    —Specialty: Infectious Diseases
    Publications
    Data Dictionary
    Data Profile/Explorer
    —Funder: US National Institutes of Health (NIH)

  • Nightingale Open Science logo

    Nightingale Open Science works with health systems around the world to create and curate datasets of medical images linked to ground-truth labels. Data are deidentified and available for non-profit research on a cloud infrastructure.
    —Access: Registration Required
    —Specialty: Sudden Cardiac Death, Cancer, Covid-19
    Funders: Schmidt Futures, The Gordon and Betty Moore Foundation, and Ken Griffin, founder and CEO of Citadel

  • Northwestern Medicine Enterprise Data Warehouse

    The Northwestern Medicine Enterprise Data Warehouse (NMEDW) aggregates data on more than 6.6 million patients seen by the health system. Data sources include electronic health records (EHR), pathology data from hospital and research labs, and research biomarker data.
    —Access: Controlled
    —Specialty: All
    —Funders: Northwestern University Feinberg School of Medicine and Northwestern Memorial Healthcare Corporation

  • The Surveillance, Epidemiology, and End Results (SEER) Program

    The Surveillance, Epidemiology, and End Results (SEER) Program collects and publishes cancer incidence and survival data from population-based cancer registries covering ~48.0% of the U.S. population.

    —Access: Controlled
    —Specialty: Cancer incidence, including patient demographics, primary tumor site, tumor morphology and stage at diagnosis, first course of treatment, and follow-up for vital status.
    Data Dictionary
    —Funder: National Cancer Institute
    User Review

Access definitions

  1. Controlled access: Application and eligibility requirements need to be met to gain access

  2. Registration required: Open to all, but users need to be signed in or registered with the resource to access

  3. Open access: No access restrictions or registration required to access

  4. Mixed: Has both controlled and open access

“The EchoNet-Dynamic dataset is very user friendly. The authors have done the pre-processing already, making it a great resource for users eager to try computer vision type tasks immediately. Since the dataset is pre-processed and limited to a single view of the heart, using it for strictly research purposes may be challenging.”

—Baljash Cheema, MD, MSCI, Cardiovascular Disease Fellow
Northwestern University Feinberg School of Medicine

We want to hear from you! Have you used these datasets? What are the benefits and challenges? Let us know.

Join our community

Sign up and stay connected to receive the latest news and information about opportunities.