Great data is the foundation of great health data science
Limber up with our curated list of useful datasets for research
Here are some sources of robust, diverse datasets we think you might help with your training. More user reviews coming soon.
-
The Agency for Healthcare Research and Quality (AHRQ) has assembled data related to social determinants of health from multiple federal sources into one data source. Variables include: Social context (e.g., age, race/ethnicity), economic context (e.g., income), education, physical infrastructure (e.g, housing), and healthcare context (e.g., health insurance).
—Access: Open
—Specialty: Geospatial & Social Determinants of Health
—Data Dictionary
—Funder: Patient Centered Outcomes Research (PCOR) Trust Fund -
The ongoing NIH-funded All of Us research initiative aims to build the largest and most diverse repository of health, lifestyle, genetic and environmental information, on more than 1,000,000 individuals in the US.
—Access: Controlled
—Specialty: Biochemistry & Molecular Genetics
—Publications
—Data Dictionary
—Data Profile/Explorer
—Funder: US National Institutes of Health (NIH) -
BioCatalyst is a data science platform funded by the National Heart, Lung, and Blood Institute (NHLBI) to provide access to multiple NHLBI data sets and analytic tools. Authorized researchers have access to 3.42 petabytes of data, including the Trans-Omics for Precision Medicine (TOPMed) study and database of Genotypes and Phenotypes (dbGaP).
—Access: Controlled
—Specialty: Biochemistry & Molecular Genetics
—Publications
—Data Dictionary
—Funder: US National Institutes of Health (NIH) -
The UK Biobank includes genetic and health information on over half a million UK participants, including genetic, biomarkers, imaging, EHR, and questionnaire data.
—Access: Controlled
—Specialty: Biochemistry & Molecular Genetics
—Publications
—Data Dictionary
—Funders: Established by the Wellcome Trust medical charity, Medical Research Council, Department of Health, Scottish Government and the Northwest Regional Development Agency -
The Cancer Imaging Archive, hosted by Frederick National Laboratory for Cancer Research, contains collections of de-identified medical images of cancer. Images are available in DICOM format.
—Access: Mixed
—Specialty: Pathology
—Publications
—Data Dictionary: See individual collections
—Funder: National Cancer Institute -
The Echnonet-Dynamic data set includes over 10,000 de-identified echocardiography videos with labeled measurements, tracings and calculations.
—Access: Registration required
—Specialty: Cardiology
—Publication
—Funder: Stanford University
—User Review -
The Medical Information Mart for Intensive Care (MIMIC) database includes information on over 40,000 patients at the Beth Israel Deaconess Medical Center, from 2008-2019. The data tables include structured EHR fields (diagnoses, procedures, labs) and notes from the critical care unit.
—Access: Controlled
—Specialty: Pulmonary and Critical Care
—Publications
—Data Dictionary
—Funder: US National Institutes of Health (NIH)
—Host: MIT Laboratory for Computational Physiology -
NC3 has aggregated EHR data from over 12 million individuals, including more than 4.5 million COVID-positive patients, and those with SARS 1, MERS and H1N1. Data includes diagnoses, procedures, medications, labs, and other structured fields.
—Access: Controlled
—Specialty: Infectious Diseases
—Publications
—Data Dictionary
—Data Profile/Explorer
—Funder: US National Institutes of Health (NIH) -
Nightingale Open Science works with health systems around the world to create and curate datasets of medical images linked to ground-truth labels. Data are deidentified and available for non-profit research on a cloud infrastructure.
—Access: Registration Required
—Specialty: Sudden Cardiac Death, Cancer, Covid-19
Funders: Schmidt Futures, The Gordon and Betty Moore Foundation, and Ken Griffin, founder and CEO of Citadel -
The Northwestern Medicine Enterprise Data Warehouse (NMEDW) aggregates data on more than 6.6 million patients seen by the health system. Data sources include electronic health records (EHR), pathology data from hospital and research labs, and research biomarker data.
—Access: Controlled
—Specialty: All
—Funders: Northwestern University Feinberg School of Medicine and Northwestern Memorial Healthcare Corporation -
The Surveillance, Epidemiology, and End Results (SEER) Program collects and publishes cancer incidence and survival data from population-based cancer registries covering ~48.0% of the U.S. population.
—Access: Controlled
—Specialty: Cancer incidence, including patient demographics, primary tumor site, tumor morphology and stage at diagnosis, first course of treatment, and follow-up for vital status.
—Data Dictionary
—Funder: National Cancer Institute
—User Review
Access definitions
Controlled access: Application and eligibility requirements need to be met to gain access
Registration required: Open to all, but users need to be signed in or registered with the resource to access
Open access: No access restrictions or registration required to access
Mixed: Has both controlled and open access
“The EchoNet-Dynamic dataset is very user friendly. The authors have done the pre-processing already, making it a great resource for users eager to try computer vision type tasks immediately. Since the dataset is pre-processed and limited to a single view of the heart, using it for strictly research purposes may be challenging.”
—Baljash Cheema, MD, MSCI, Cardiovascular Disease Fellow
Northwestern University Feinberg School of Medicine
We want to hear from you! Have you used these datasets? What are the benefits and challenges? Let us know.
Join our community
Sign up and stay connected to receive the latest news and information about opportunities.