Working with open-source Human Microbiome Project Data phases 1 (HMP) and 2 (iHMP): Efficient Data Access and Analysis Workflow
Syllabus
Instructors names and contact information
Levi Waldron, Graduate School of Public Health and Health Policy, City University of New York, New York, NY. email: Levi.Waldron@sph.cuny.edu
Ni Zhao, Department of Biostatistics, Johns Hopkins University, Baltimore, MD. email: nzhao10@jhu.edu
Ekaterina Smirnova, Department of Biostatistics, Virginia Commonwealth University, Richmond, VA. email: ekaterina.smirnova@vcuhealth.org
Workshop Description
The composition of microbial species in a human body is essential for maintaining human health, and it is associated with a number of diseases including obesity, bowel inflammatory disease, and bacterial vaginosis. Over the last decade, microbiome data analysis almost entirely shifted towards using samples taken directly from various sites of human body and to explore a large number of microbes using 16S or whole metagenome sequencing. This technological shift has led to a radical change in the data collected and led to the creation of the Human Microbiome Project (HMP) in 2008. With the growth and success of the microbiomics field, the size and complexity, and availability of the microbiome data in any given experiment have increased exponentially. Publicly-available data sets from amplicon sequencing of two 16S ribosomal RNA variable regions, with extensive controlled-access participant data, provide a reference for ongoing microbiome studies. However, utilization of these data sets can be hindered by the complex bioinformatic steps required to access, import, decrypt, and merge the various components in formats suitable for ecological and statistical analysis.
The main goal of this workshop is to provide a much-needed tutorial on downloading, understanding, and analyzing the publicly-available HMP data. We will describe the two packages, HMP16SData and HMP2Data, that provide the comprehensive tutorial on the analysis of Phase 1 and 2 of the HMP project, respectively. These packages provide count data for both 16S ribosomal RNA variable regions, integrated with phylogeny, taxonomy, public participant data, and controlled participant data for authorized researchers, using standard integrative Bioconductor data objects. As such, these packages provide the first comprehensive publicly-available tool for working with microbiome data collected at different body sites (e.g., fecal, nasal, vaginal) coupled with other omics approaches (e.g., cytokines, whole metagenome), and is conducted longitudinally on multiple subjects at multiple visits. By removing bioinformatic hurdles of data access and management, HMP16SData and HMP2Data enable researchers with only basic R skills to quickly analyze HMP data. This allows studying the dynamics of microbial composition over the disease progression, discovering disease-specific microbial biomarkers, and understanding the role of microbiome in connection to other omics measurements.
This workshop will consist of the series on the instructors-led live demos presented together with the brief lecture materials and statistical methods review when necessary. Completely worked out examples are available through the HMP16SData and HMP2Data package vignettes.
Pre-requisites
- Basic knowledge of R syntax
- Familiarity with the microbiome data
- Familiarity with alpha and beta diversity in the context of microbiome data analysis
- Familiarity with principal components and principal coordinates analysis
Background reading:
- Schiffer L, Azhar R, Shepherd L, Ramos M, Geistlinger L, Huttenhower C, Dowd JB, Segata N, Waldron L (2019). “HMP16SData: Efficient Access to the Human Microbiome Project through Bioconductor.” American Journal of Epidemiology. doi: 10.1093/aje/kwz006.
Time outline
Activity | Time |
---|---|
HMP16SData | 10m |
HMP2Data | 10m |
Microbiome Analyses | 15m |
Multi-omics data integration | 15m |
Workshop goals and objectives
The main purpose of this workshop is to help researchers interested in public microbiome data to start using the HMP16SData and HMP2Data packages.
Learning goals
- understand the goals of HMP projects and the studies available through the HMP DAC portal
- download and use HMP data packages
- understand the difference between phase 1 and phase 2 projects
- understand principles of -omics data integration and visualization
Learning objectives
- install HMP data packages and access vignettes
- download HMP data to produce analyses illustrated in package vignettes
- create sample summary statistics
- create alpha and beta diversity plots
- understand longitudinal patterns and examine microbiome diversity changes
- integrate 16S and cytokines data using co-inertia analyses techniques
- learn capturing patterns across -omics data sets of different structure
Setup
Installing all packages necessary for this workshop can be accomplished by the following command:
BiocManager::install("waldronlab/MicrobiomeWorkshop", build_vignettes=TRUE,
dependencies=TRUE)
Load the workshop package:
library(MicrobiomeWorkshop)
Then you should be able to view the compiled vignette:
vignette("MicrobiomeWorkshop")
Note, all packages are listed in the DESCRIPTION file of this workshop.
Also some utility packages:
suppressPackageStartupMessages({
library(kableExtra)
library(magrittr)
library(reshape2)
library(gridExtra)
})
Source some miscellaneous R scripts for HMP2Data component:
source(system.file(package='MicrobiomeWorkshop', 'vignettes', "CIAPlots.R"))
source(system.file(package='MicrobiomeWorkshop', 'vignettes', "ScreePlot.R"))
Introduction
Bioconductor provides curated resources of microbiome data. Most microbiome data are generated either by targeted amplicon sequencing (usually of variable regions of the 16S ribosomal RNA gene) or by metagenomic shotgun sequencing (MGX). These two approaches are analyzed by different sequence analysis tools, but downstream statistical and ecological analysis can involve any of the following types of data:
- taxonomic abundance at different levels of the taxonomic hierarchy
- phylogenetic distances and the phylogenetic tree of life
- metabolic potential of the microbiome
- abundance of microbial genes and gene families
A review of types and properties of microbiome data is provided by (Morgan and Huttenhower 2012).
curatedMetagenomicData: Curated and processed metagenomic data through ExperimentHub
curatedMetagenomicData
(Pasolli et al. 2017) provides 6 types of processed data for >30 publicly available whole-metagenome shotgun sequencing datasets, including from the Human Microbiome Project Phase 1:
- Species-level taxonomic profiles, expressed as relative abundance from kingdom to strain level
- Presence of unique, clade-specific markers
- Abundance of unique, clade-specific markers
- Abundance of gene families
- Metabolic pathway coverage
- Metabolic pathway abundance
Types 1-3 are generated by MetaPhlAn2; 4-6 are generated by HUMAnN2.
Currently, curatedMetagenomicData
provides:
- 8184 samples from 46 datasets, primarily of the human gut but including body sites profiled in the Human Microbiome Project
- Processed data from whole-metagenome shotgun metagenomics, with manually-curated metadata, as integrated and documented Bioconductor ExpressionSet objects
- ~80 fields of specimen metadata from original papers, supplementary files, and websites, with manual curation to standardize annotations
- Processing of data through the MetaPhlAn2 pipeline for taxonomic abundance, and HUMAnN2 pipeline for metabolic analysis
- These represent ~100TB of raw sequencing data, but the processed data provided are much smaller.
These datasets are documented in the reference manual.
This is an ExperimentHub
package, and its main workhorse function is curatedMetagenomicData()
:
The manually curated metadata for all available samples are provided in a single table combined_metadata
:
library(curatedMetagenomicData)
?combined_metadata
View(data.frame(combined_metadata))
The main function provides a list
of ExpressionSet
objects:
oral <- c("HMP_2012.metaphlan_bugs_list.oralcavity")
esl <- curatedMetagenomicData(oral, dryrun = FALSE)
#> Working on HMP_2012.metaphlan_bugs_list.oralcavity
#> snapshotDate(): 2019-06-20
#> see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> 'EH2491 : 2491'
esl
#> List of length 1
#> names(1): HMP_2012.metaphlan_bugs_list.oralcavity
These ExpressionSet
objects can also be converted to phyloseq
object for ecological analysis and differential abundance analysis using the DESeq2
package, using the ExpressionSet2phyloseq()
function:
ExpressionSet2phyloseq( esl[[1]], phylogenetictree = TRUE)
#> Loading required namespace: phyloseq
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 600 taxa and 415 samples ]
#> sample_data() Sample Data: [ 415 samples by 18 sample variables ]
#> tax_table() Taxonomy Table: [ 600 taxa by 8 taxonomic ranks ]
#> phy_tree() Phylogenetic Tree: [ 600 tips and 599 internal nodes ]
See the documentation of phyloseq
for more on ecological and differential abundance analysis of the microbiome.
HMP16SData: 16S rRNA Sequencing Data from the Human Microbiome Project
suppressPackageStartupMessages(library(HMP16SData))
#> snapshotDate(): 2019-06-20
HMP16SData
(Schiffer et al. 2019) is a Bioconductor ExperimentData package of the Human Microbiome Project (HMP) 16S rRNA sequencing data. Taxonomic count data files are provided as downloaded from the HMP Data Analysis and Coordination Center from its QIIME pipeline. Processed data is provided as SummarizedExperiment
class objects via ExperimentHub
. Like other ExperimentHub-based packages, a convenience function does downloading, automatic local caching, and serializing of a Bioconductor data class. This returns taxonomic counts from the V1-3 variable region of the 16S rRNA gene, along with the unrestricted participant data and phylogenetic tree.
V13()
#> snapshotDate(): 2019-06-20
#> see ?HMP16SData and browseVignettes('HMP16SData') for documentation
#> downloading 0 resources
#> loading from cache
#> 'EH1117 : 1117'
#> class: SummarizedExperiment
#> dim: 43140 2898
#> metadata(2): experimentData phylogeneticTree
#> assays(1): 16SrRNA
#> rownames(43140): OTU_97.1 OTU_97.10 ... OTU_97.9997 OTU_97.9999
#> rowData names(7): CONSENSUS_LINEAGE SUPERKINGDOM ... FAMILY GENUS
#> colnames(2898): 700013549 700014386 ... 700114963 700114965
#> colData names(7): RSID VISITNO ... HMP_BODY_SUBSITE SRS_SAMPLE_ID
This can also be converted to phyloseq
for ecological and differential abundance analysis; see the HMP16SData
vignette for details.
HMP2Data
Integrative Human Microbiome Project (iHMP) https://hmpdacc.org/ihmp/
One of the largest open data resources for studying longitudinal microbiome changes and connection to disease
Novel data – first scientific results just published in the special collection of Nature and its sister journals (https://www.nature.com/articles/s41586-019-1238-8)
- Human Microbiome Project Data Portal (https://portal.hmpdacc.org/)
- Open data – but not straightforward to download and work with
- Need Aspera client to even download the data – not every lab/researcher has these expertise available
- Still need multiple processing steps after data is downloaded to:
map taxonomy ids from the database
merge with meta data – not available on the DAC portal
construct phylogenetic tree
merge omics data.
HMP2Data package
The following packages will be used in this vignette to provide demonstrative examples of what a user might do with HMP2Data.
suppressPackageStartupMessages({
library(phyloseq)
library(SummarizedExperiment)
library(MultiAssayExperiment)
library(dplyr)
library(ggplot2)
library(UpSetR)
library(ade4)
library(vegan)
library(MiRKAT)
})
HMP2Data
package is currently under review as a Bioconductor package. Current (development) version can be installed can be installed from John Stansfield’s GitHub repository https://github.com/jstansfield0/HMP2Data as follows. This is the version used for the workshop.
BiocManager::install("jstansfield0/HMP2Data")
The stable version is available at the dozmorovlab GitHub repository https://github.com/dozmorovlab/HMP2Data
BiocManager::install("dozmorovlab/HMP2Data")
Load HMP2Data
package
library(HMP2Data)
Once installed, HMP2Data
provides functions to access the data from the three major HMP2 studies.
Multi-omics microbiome study pregnancy initiavite (MOMS-PI)
- Vaginal microbiome consortium study at Virginia Commonwealth University (VCU) http://vmc.vcu.edu/momspi
- Pre-term birth study results: https://www.nature.com/articles/s41591-019-0450-2
- Trem-birth study: https://www.nature.com/articles/s41591-019-0465-8
- Only a subset of samples available on the DAC portal was used in current publications
- Novel longitudinal and multi-omics models have not yet been throughly explored
- Data structure gives many opportunities for statistical models development and data exploration
16S data
Load 16S data as a matrix, rows are Greengene IDs, columns are sample names:
data("momspi16S_mtx")
Load the Greengenes taxonomy table as a matrix, rows are Greengene IDs, columns are taxonomic ranks:
data("momspi16S_tax")
# Check if Greengene IDs match between the 16S and taxonomy data
# all.equal(rownames(momspi16S_mtx), rownames(momspi16S_tax)) # Should be TRUE
Load the 16S sample annotation data as a matrix, rows are samples, columns are annotations:
data("momspi16S_samp")
# Check if sample names match between the 16S and sample data
# all.equal(colnames(momspi16S_mtx), rownames(momspi16S_samp)) # Should be TRUE
The momspi16S
function assembles those matrices into a phyloseq
object.
momspi16S_phyloseq <- momspi16S()
momspi16S_phyloseq
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 7665 taxa and 9107 samples ]
#> sample_data() Sample Data: [ 9107 samples by 13 sample variables ]
#> tax_table() Taxonomy Table: [ 7665 taxa by 7 taxonomic ranks ]
Cytokine data
The MOMS-PI cytokine data can be loaded as a matrix, rownames are cytokine names, colnames are sample names:
data("momspiCyto_mtx")
momspiCyto_mtx[1:5, 1:3]
#> EP004835_K10_MVAX EP004835_K20_MVAX EP004835_K30_MVAX
#> Eotaxin 68.05 60.08 37.84
#> FGF 265.45 -2.00 -2.00
#> G-CSF 624.79 2470.70 -2.00
#> GM-CSF 51.39 -2.00 42.01
#> IFN-g 164.58 193.22 184.53
The cytokine data has 29 rows and 1396 columns.
Load the cytokine sample annotation data as a matrix, rows are samples, columns are annotations:
data("momspiCyto_samp")
colnames(momspiCyto_samp)
#> [1] "file_id" "md5" "size"
#> [4] "urls" "sample_id" "file_name"
#> [7] "subject_id" "sample_body_site" "visit_number"
#> [10] "subject_gender" "subject_race" "study_full_name"
#> [13] "project_name"
The function momspiCytokines
will make a SummarizedExperiment
containing cytokine data
momspiCyto <- momspiCytokines()
momspiCyto
#> class: SummarizedExperiment
#> dim: 29 1396
#> metadata(0):
#> assays(1): ''
#> rownames(29): Eotaxin FGF ... FGF basic IL-17
#> rowData names(1): cytokine
#> colnames(1396): EP004835_K10_MVAX EP004835_K20_MVAX ...
#> EP996091_K40_MVAX EP996091_K60_MVAX
#> colData names(13): file_id md5 ... study_full_name project_name
IBD
Load 16S data as a matrix, rows are SILVA IDs, columns are sample names:
data("IBD16S_mtx")
Load the SILVA taxonomy table as a matrix, rows are SILVA IDs, columns are taxonomic ranks:
data("IBD16S_tax")
colnames(IBD16S_tax)
#> [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus"
Load the 16S sample annotation data as a matrix, rows are samples, columns are annotations:
data("IBD16S_samp")
colnames(IBD16S_samp) %>% head()
#> [1] "Project" "sample_id" "subject_id" "site_sub_coll"
#> [5] "data_type" "week_num"
The IBD phyloseq
object can be loaded as follows.
IBD <- IBD16S()
IBD
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 982 taxa and 178 samples ]
#> sample_data() Sample Data: [ 178 samples by 490 sample variables ]
#> tax_table() Taxonomy Table: [ 982 taxa by 6 taxonomic ranks ]
T2D
Load 16S data as a matrix, rows are Greengene IDs, columns are sample names:
data("T2D16S_mtx")
Load the Greengenes taxonomy table as a matrix, rows are Greengene IDs, columns are taxonomic ranks:
data("T2D16S_tax")
colnames(T2D16S_tax)
#> [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
Load the 16S sample annotation data as a matrix, rows are samples, columns are annotations:
data("T2D16S_samp")
colnames(T2D16S_samp)
#> [1] "file_id" "md5" "size"
#> [4] "urls" "sample_id" "file_name"
#> [7] "subject_id" "sample_body_site" "visit_number"
#> [10] "subject_gender" "subject_race" "study_full_name"
#> [13] "project_name"
The T2D phyloseq
object can be loaded like so.
T2D <- T2D16S()
T2D
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 12062 taxa and 2208 samples ]
#> sample_data() Sample Data: [ 2208 samples by 13 sample variables ]
#> tax_table() Taxonomy Table: [ 12062 taxa by 7 taxonomic ranks ]
Features
Frequency Table Generation
The sample-level annotation data contianed in HMP2Data
can be summarized using the table_two
function. First, we need to make a list of the phyloseq
and SummarizedExperiment
objects which can then be entered into the table_two()
table generating function.
list("MOMS-PI 16S" = momspi16S_phyloseq, "MOMS-PI Cytokines" = momspiCyto, "IBD 16S" = IBD, "T2D 16S" = T2D) %>% table_two()
N | % | N | % | N | % | N | % | |
---|---|---|---|---|---|---|---|---|
Body Site | ||||||||
buccal mucosa | 3313 | 36.38 | 311 | 22.28 | 0 | 0 | 0 | 0 |
cervix of uterus | 162 | 1.78 | 0 | 0 | 0 | 0 | 0 | 0 |
feces | 765 | 8.4 | 0 | 0 | 178 | 100 | 1041 | 47.15 |
rectum | 2679 | 29.42 | 0 | 0 | 0 | 0 | 0 | 0 |
unknown | 146 | 1.6 | 0 | 0 | 0 | 0 | 0 | 0 |
vagina | 2042 | 22.42 | 979 | 70.13 | 0 | 0 | 0 | 0 |
blood cell | 0 | 0 | 106 | 7.59 | 0 | 0 | 0 | 0 |
nasal cavity | 0 | 0 | 0 | 0 | 0 | 0 | 1167 | 52.85 |
Sex | ||||||||
male | 0 | 0 | 0 | 0 | 84 | 47.19 | 1248 | 56.52 |
female | 9107 | 100 | 1396 | 100 | 94 | 52.81 | 947 | 42.89 |
unknown | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 0.59 |
Race | ||||||||
african american | 0 | 0 | 0 | 0 | 5 | 2.81 | 117 | 5.3 |
asian | 0 | 0 | 0 | 0 | 0 | 0 | 235 | 10.64 |
caucasian | 0 | 0 | 0 | 0 | 164 | 92.13 | 1657 | 75.05 |
ethnic other | 0 | 0 | 0 | 0 | 8 | 4.49 | 73 | 3.31 |
hispanic or latino | 0 | 0 | 0 | 0 | 0 | 0 | 126 | 5.71 |
american indian or alaska native | 0 | 0 | 0 | 0 | 1 | 0.56 | 0 | 0 |
unknown | 9107 | 100 | 1396 | 100 | 0 | 0 | 0 | 0 |
total samples | 9107 | 100 | 1396 | 100 | 178 | 100 | 2208 | 100 |
Visits Histograms
We can plot the histogram of the number of samples at each visit for all studies. Note that the X-axis is capped at count of 40 for better visualization.
# make data.frame for plotting
plot_visits <- data.frame(study = c(rep("MOMS-PI Cytokines", nrow(momspiCyto_samp)),
rep("IBD 16S", nrow(IBD16S_samp)),
rep("T2D 16S", nrow(T2D16S_samp))),
visits = c(momspiCyto_samp$visit_number,
IBD16S_samp$visit_number,
T2D16S_samp$visit_number))
p2 <- ggplot(plot_visits, aes(x = visits, fill = study)) +
geom_histogram(position = "dodge", alpha = 0.7, bins = 30, color = "#00BFC4") + xlim(c(0, 40)) +
theme(legend.position = c(0.8, 0.8)) +
xlab("Visit number") + ylab("Count")
p2
#> Warning: Removed 677 rows containing non-finite values (stat_bin).
#> Warning: Removed 4 rows containing missing values (geom_bar).
Note that there are 677 samples at visit numbers over 40.
UpsetR plots
MOMS-PI 16S rRNA data
Here we plot the body sites which samples were taken from for each patient in the MOMS-PI 16S data.
# set up data.frame for UpSetR
momspi_upset <- aggregate(momspi16S_samp$sample_body_site, by = list(momspi16S_samp$subject_id), table)
tmp <- as.matrix(momspi_upset[, -1])
tmp <- (tmp > 0) *1
momspi_upset <- data.frame(patient = momspi_upset$Group.1, tmp)
# plot
upset(momspi_upset, order.by = 'freq', matrix.color = "blue", text.scale = 2)
MOMS-PI Cytokines data
Here we plot the body sites which samples were taken from for each patient in the MOMS-PI cytokines data.
# set up data.frame for UpSetR
momspiCyto_upset <- aggregate(momspiCyto_samp$sample_body_site, by = list(momspiCyto_samp$subject_id), table)
tmp <- as.matrix(momspiCyto_upset[, -1])
tmp <- (tmp > 0) *1
momspiCyto_upset <- data.frame(patient = momspiCyto_upset$Group.1, tmp)
# plot
upset(momspiCyto_upset, order.by = 'freq', matrix.color = "blue", text.scale = 2)
IBD data
The IBD patients only had samples taken from the feces.
T2D data
Here we plot the body sites which samples were taken from for each patient in the T2D 16S rRNA data.
# set up data.frame for UpSetR
T2D_upset <- aggregate(T2D16S_samp$sample_body_site, by = list(T2D16S_samp$subject_id), table)
tmp <- as.matrix(T2D_upset[, -1])
tmp <- (tmp > 0) *1
T2D_upset <- data.frame(patient = T2D_upset$Group.1, tmp)
# plot
upset(T2D_upset, order.by = 'freq', matrix.color = "blue", text.scale = 2)
Microbiome Analyses: IBD example
Table1Var <- c("subject_gender", "Age.at.diagnosis", "race", "visit_number",
"site_name", "Education.Level", "Antibiotics", "BMI")
demo_data = sample_data(IBD)[ ,c("sample_id", "subject_id", Table1Var, "diagnosis")] %>%
data.frame()
#all <- CreateTableOne(vars = Table1Var, data = demo_data)
#kableone(all)
stratified = tableone::CreateTableOne(
vars = Table1Var,
data = summarytools::unlabel(demo_data), strata = "diagnosis", includeNA = TRUE)
#> Registered S3 method overwritten by 'pryr':
#> method from
#> print.bytes Rcpp
#> Warning in ModuleReturnVarsExist(vars, data): These variables only have NA/
#> NaN: BMI Dropped
#> Warning in min(x, na.rm = TRUE): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf
#> Warning in StdDiff(variable = var, group = strataVar): Variable has only
#> NA's in at least one stratum. na.rm turned off.
stratified <- print(stratified, printToggle = FALSE, showAllLevels = FALSE)
stratified[,!(colnames(stratified) %in% "test")]%>%
knitr::kable(format = "html", caption = "Characteristics of the Study Cohort",
col.names = c("CD", "nonIBD",
"UC", "p-value"))%>%
kable_styling("striped", full_width = T)
CD | nonIBD | UC | p-value | |
---|---|---|---|---|
n | 86 | 46 | 46 | |
subject_gender = male (%) | 43 ( 50.0) | 24 ( 52.2) | 17 ( 37.0) | 0.264 |
Age.at.diagnosis (mean (SD)) | 21.34 (11.31) | NaN (NA) | 25.65 (15.21) | 0.067 |
race (%) | 0.002 | |||
American Indian or Alaska Native | 1 ( 1.2) | 0 ( 0.0) | 0 ( 0.0) | |
Black or African American | 0 ( 0.0) | 0 ( 0.0) | 5 ( 10.9) | |
More than one race | 0 ( 0.0) | 2 ( 4.3) | 2 ( 4.3) | |
Other | 4 ( 4.7) | 0 ( 0.0) | 0 ( 0.0) | |
White | 81 ( 94.2) | 44 ( 95.7) | 39 ( 84.8) | |
visit_number (mean (SD)) | 5.33 (10.80) | 2.35 (6.39) | 1.67 (4.57) | 0.034 |
site_name (%) | <0.001 | |||
Cedars-Sinai | 21 ( 24.4) | 2 ( 4.3) | 23 ( 50.0) | |
Cincinnati | 28 ( 32.6) | 20 ( 43.5) | 11 ( 23.9) | |
MGH | 24 ( 27.9) | 18 ( 39.1) | 6 ( 13.0) | |
MGH Pediatrics | 13 ( 15.1) | 6 ( 13.0) | 6 ( 13.0) | |
Education.Level (%) | 0.001 | |||
7th grade or less | 14 ( 16.3) | 18 ( 39.1) | 7 ( 15.2) | |
Bachelor’s degree | 9 ( 10.5) | 10 ( 21.7) | 4 ( 8.7) | |
High school graduate or GED | 5 ( 5.8) | 2 ( 4.3) | 0 ( 0.0) | |
Master’s degree | 0 ( 0.0) | 2 ( 4.3) | 4 ( 8.7) | |
Professional/Doctoral degree | 6 ( 7.0) | 2 ( 4.3) | 4 ( 8.7) | |
Some college, no degree | 12 ( 14.0) | 0 ( 0.0) | 7 ( 15.2) | |
Some high school | 27 ( 31.4) | 8 ( 17.4) | 10 ( 21.7) | |
Unknown/Not Reported | 11 ( 12.8) | 2 ( 4.3) | 10 ( 21.7) | |
NA | 2 ( 2.3) | 2 ( 4.3) | 0 ( 0.0) | |
Antibiotics = No (%) | 86 (100.0) | 46 (100.0) | 46 (100.0) | NA |
###Alpha diversity analysis
Alpha Diversity Definitions
Alpha diversity is a within-sample diversity measurement that models the richness and/or eveness of the microbial community. Different alpha diversities emphasize different aspect of the community. Commonly used alpha diversities include:
Diversity | Description | Formula |
---|---|---|
Observed Species | No. of unique taxa in a sample | Number of species |
Chao 1 | Adding a correction to observed species | $S_{ob} + \frac{n_1^2}{2n_2}$ |
Shannon | Both richness and evenness | $-\sum_{i=1}^{s} p_i \log(p_i)$ |
Simpson reciprocal | Both richness and evenness | $\frac{N(N - 1)}{n(n-1)}$ |
Pielou’s evenness | evenness | $\frac{H}{H_{max}}$ |
Of course, the alpha diversity measurement is impacted by the sequencing depth (total number of reads per sample). Rarefying (through vegan:rarefy) is necessary before calculating alpha diversity.
Comparing Alpha Diversities between Groups
We can use “plot_richness” from phyloseq package to plot the alpha diversities.
IBD %<>%
taxa_sums() %>%
is_greater_than(0) %>%
prune_taxa(IBD)
sample_data(IBD)$diagnosis = factor(sample_data(IBD)$diagnosis)
p = plot_richness(IBD, x="diagnosis", color="diagnosis", measures=c("Observed", "InvSimpson", "Shannon", "Chao1")) + geom_jitter()
p + geom_boxplot(data = p$data, aes(x = diagnosis, y = value, color = NULL), alpha = 0.1)
#> Warning: Removed 534 rows containing missing values (geom_errorbar).
Once we summarize the whole community through a one-dimensional summary statistics (alpha diversity), all the traditional methods for hypothesis testing are applicable.
alpha = estimate_richness(IBD, measures=c("Observed", "InvSimpson", "Shannon"))
alpha_df = data.frame(sample_data(IBD)[,c("sample_id","subject_id")], alpha)
rownames(alpha_df) = NULL
kable(head(alpha_df),
digits = 3,caption = "Some Alpha Diversities of the IBD dataset") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F,position = "center")
sample_id | subject_id | Observed | Shannon | InvSimpson |
---|---|---|---|---|
206534 | M2008 | 112 | 2.572 | 5.590 |
206536 | M2008 | 96 | 2.534 | 5.565 |
206538 | M2008 | 80 | 2.655 | 6.452 |
206547 | M2014 | 77 | 2.733 | 7.030 |
206548 | M2014 | 69 | 2.632 | 6.556 |
206561 | M2021 | 63 | 3.357 | 16.324 |
Beta diversity analysis
Beta Diversity Definitions
Beta diversity describes how samples vary against each other. Beta diversity is typically the thinking behind “clustering” algorithms that show differences or similarities among samples. For example, we may be intrested in the differences in gut microbiome between non IBD and IBD patients.
List of the commonly used beta-diversities:
Diversity | Description | Formula |
---|---|---|
UniFrac | Qualitative, Phylogenetics-based | $\sum_{i=1}^n\frac{b_i|I(p_i^A > 0)-I(p_i^B > 0)|}{\sum_{i=1}^n b_i}$ |
Weighted UniFrac | Quantitative,Phylogenetics-based | $\frac{\sum_{i=1}^n b_i|p_i^A - p_i^B|}{\sum_{i=1}^nb_i(p_i^A + p_i^B)}$ |
Generalized UniFrac | Compromise between the previous two | $\frac{\sum_{i=1}^n b_i (p_i^A + p_i^B)^{\alpha} |\frac{p_i^A - p_i^B}{p_i^A + p_i^B}|} {\sum_{i=1}^nb_i(p_i^A + p_i^B)^{\alpha}}$ |
Bray-Curtis | Quantitative | $1 - \frac{2C_{i,j}}{S_i + S_j}$ |
Jaccard | Qualitative | $\frac{|A \cap B|}{|A \cup B|}$ |
Comparing Beta Diversities between Groups
par(mfrow = c(1,2))
jac <- as.matrix(distance(t(otu_table(IBD)), method = "jaccard"))
mod1 <- vegan::betadisper(as.dist(jac), as.factor(IBD16S_samp$diagnosis))
plot(mod1, ellipse = TRUE, hull = FALSE, main="Jaccard Distance", xlab="PC 1", ylab="PC 2", sub=NULL)
box()
bc = as.matrix(ecodist::bcdist(t(otu_table(IBD))))
mod2 <- vegan::betadisper(as.dist(bc), as.factor(IBD16S_samp$diagnosis))
plot(mod2, ellipse = TRUE, hull = FALSE, main="Bray-Curtis Dissimilarity", xlab="PC 1", ylab="PC 2", sub=NULL, xlim = c(-0.2, 0.2))
legend("bottomright", legend=c("CD", "nonIBD", "UC"), pch = 1:3, col=c("Black", "red", "Green"), cex=1.3, box.lty=0)
box()
#####Permutational Multivariate Analysis of Variance
Permutational multivariate analysis of variance (PERMANOVA) (???), a non-parametric multivariate statistical test, is one commonly used for assessing the association between microbiome community and a phenotype (e.g., IBD or not).
However, the PERMANOVA approach, by itself, doesn’t allow for repeated measures. Ignoring the repeated measures (treating them as if they are all independent samples) can lead to inflated type I error (i.e., more false positive discoveries than we would like to accept through our p-value cutoff). However, we can use the method if we consider a subset of the samples which are independent with each other, such as those from the baseline.
PERMANOVA requires the selection of a single distance metric. Howeve, if the selected distance metric doesn’t capture the underlying association pattern, PERMANOVA tends to lose power.
IBD0 = subset_samples(IBD, week_num == 0&diagnosis %in% c("nonIBD", "CD"))
IBD0 %<>%
taxa_sums() %>%
is_greater_than(0) %>%
prune_taxa(IBD0)
jac0 <- as.matrix(distance(t(otu_table(IBD0)), method = "jaccard"))
bc0 = as.matrix(ecodist::bcdist(t(otu_table(IBD0))))
CD = as.numeric(sample_data(IBD0)$diagnosis == "CD")
gender = as.numeric(factor(sample_data(IBD0)$subject_gender))
fit = vegan::adonis(bc0 ~ gender + CD, permutations=500)
kable(fit$aov.tab,
digits = 3,caption = "PERMANOVA result using Bray-Curtis distance") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Df | SumsOfSqs | MeanSqs | F.Model | R2 | Pr(>F) | |
---|---|---|---|---|---|---|
gender | 1 | 0.421 | 0.421 | 1.288 | 0.017 | 0.178 |
CD | 1 | 0.740 | 0.740 | 2.260 | 0.030 | 0.010 |
Residuals | 71 | 23.237 | 0.327 | NA | 0.952 | NA |
Total | 73 | 24.398 | NA | NA | 1.000 | NA |
Microbiome Regression Kernel Association Test (MiRKAT)
MiRKAT: a regression approach for flexibly associating the microbial community with phenotype of interest (???). The method also has extensions that can handle longitudinal measures (???), multivariate (???) and survial outcomes (???).
g(yi) = Xiβ + f(Mi) + ϵi, f∼ a reproducing kernel Hilbert space generated by function K. Equivalently, f ∼ N(0, τK), in which K is a similarity matrix that can be calculated through a distance metric.
Multiple kernels (distances) are allowed simulatenously in this framework, making this procedure robust to the underlying association patterns.
K_bc = MiRKAT::D2K(bc0); K_jac = D2K(jac0)
y = ifelse(sample_data(IBD0)$diagnosis == "CD",1, 0 )
out = MiRKAT::MiRKAT(y=y, X = gender, Ks = list(K_bc, K_jac), out_type = "D")
Multi-omics data integration: MOMS-PI example
Combine 16S and cytokines data
#order both sets by visit number within a subject
momspi16S_samp <- momspi16S_samp[
with(momspi16S_samp, order(subject_id, sample_body_site, visit_number)),
]
momspiCyto_samp <- momspiCyto_samp[
with(momspiCyto_samp, order(subject_id, sample_body_site, visit_number)),
]
#select data collected at the same visit
combined_samp <- merge(momspi16S_samp, momspiCyto_samp,
by = c("subject_id", "sample_body_site",
"project_name", "study_full_name",
"subject_gender", "subject_race",
"visit_number"))
In this tutorial, we concentrate on cross-sectional analysis of the first visit data, which corresponds to earlier time during pregnancy.
#select data from first visit only
combined_samp <- combined_samp[combined_samp$visit_number == 1,]
table(combined_samp$sample_body_site)#all vaginal samples
#>
#> vagina
#> 225
We match the samples (contained in columns of both tables) by the file names contained in colnames of each table.
#> [1] 29 225
In ‘combined_samp’ object the names of matched files names for 16S data are recorded in column ‘file.x’ and for cytokines data in column ‘file.y’.
Make sure that samples are in rows and variables (taxa and cytokines) are in columns.
combined_16S_mtx <- t(combined_16S_mtx)
combined_Cyto_mtx <- t(combined_Cyto_mtx)
Taxa are converted to proportions.
combined_16S_mtx <- combined_16S_mtx/apply(combined_16S_mtx, 1, sum)
#> [1] TRUE
Co-inertia analysis
Basics:
- Let X and Y be 16S and cytokines tables respectively
- Rows: same n women at first visit
- Columns: p1 taxa and p2 cytokines
- Column weights: QX and QY
- Row weights: D
- PCA analysis of each table: (X, QX, D) and (Y, QY, D)
- Co-inertia axes: YTDX = KΛ1/2AT from eigendecomposition of (YTDX, QX, QY)
- Plot FX = XA and FY = YK
Taxa normalization:
- Center 16S data to work with PCA on the covariance matrix ΣX = Cov(X)
- To normalize the magnitude, divide each value of X by the total variance: $\sqrt{\mbox{tr}(\Sigma_X)}$
- Note: step 2 is quivalent to dividing the matrix by $\sqrt{\sum_{k=1}^r \lambda_k}$, where λk are the eigevalues of ΣX and r is the rank of X.
taxa_mtx <- scale(combined_16S_mtx, center = TRUE, scale = FALSE)
#use fast trace computation formula: tr(A^B) = sum(A*B), where '*' operator refers to elemetwise product
taxa_tr <- sum(taxa_mtx*taxa_mtx)/(dim(taxa_mtx)[1]-1)
taxa_mtx <- taxa_mtx/sqrt(taxa_tr)
taxa.pca <- ade4::dudi.pca(taxa_mtx, scannf=FALSE, nf =61,
center = FALSE, scale = FALSE)
Cytokines normalization:
- Center and scale cytokines data to work with PCA on the correlation matrix ΣX = Cor(Y)
- To normalize the magnitude, divide each value of Y by the total variance: $\sqrt{\mbox{tr}(\Sigma_Y)}$
cyto_mtx <- scale(combined_Cyto_mtx, center = TRUE, scale = TRUE)
cyto_tr <- sum(cyto_mtx*cyto_mtx)/(dim(cyto_mtx)[1]-1)
cyto_mtx <- cyto_mtx/sqrt(cyto_tr)
cyto.pca <- ade4::dudi.pca(cyto_mtx, scannf=FALSE, nf =61,
center = FALSE, scale = FALSE)
Combine the tables using co-inertia
Co-inertia is available through R package ade4. It takes ade4 PCA objects and performes joint eigendecomposition.
coin <- ade4::coinertia(taxa.pca, cyto.pca, scannf = FALSE, nf = 2)
RV coefficient – measure of similarity between 16S and cytokines tables
RV <- coin$RV
RV
#> [1] 0.0396224
Visualization of output
Interpretation of the plots is similar to examining PCA results.
Plot below provides variables projection on common co-inertia axes.
Variables projected close to each other reflect similarilty within and across two data sets
Variables with larger values on each component (x- and y-axes) have more importance
We start by identifying important taxa (larger component values):
- visually from the plot, or
taxa.inx <- c(6, 19, 24, 28, 31, 68, 69, 396)
- by specifying x and y coordinates to pull all variables with coordinates larger (or smaller) than these values. Table 1 (taxa) values can be accessed $co object of coinertia output named coin.
kable(head(coin$co),
digits = 5)%>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Comp1 | Comp2 | |
---|---|---|
807795 | 0.00002 | 0.00002 |
134726 | -0.00006 | 0.00003 |
215097 | 0.00000 | 0.00000 |
542066 | -0.00001 | -0.00004 |
851634 | 0.00001 | 0.00000 |
354905 | 0.01164 | -0.00768 |
Similarly, table 2 (here cytokines) variables loadings can be extracted from the $li object of coinertia output named coin.
kable(head(coin$li),
digits = 5)%>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Axis1 | Axis2 | |
---|---|---|
Eotaxin | -0.01231 | -0.00518 |
FGF | -0.00549 | -0.00471 |
G-CSF | 0.00815 | -0.01028 |
GM-CSF | 0.01838 | -0.00531 |
IFN-g | -0.00426 | -0.00404 |
IL-10 | -0.00591 | -0.00791 |
Greengene ID | Plot ID | Family | Genus | Species |
---|---|---|---|---|
134467 | 19 | Lactobacillaceae | Lactobacillus | NA |
137183 | 28 | Bifidobacteriaceae | Gardnerella | NA |
137580 | 24 | Lactobacillaceae | Lactobacillus | NA |
29566 | 69 | Leptotrichiaceae | Sneathia | NA |
309133 | 396 | Enterococcaceae | NA | NA |
332718 | 31 | Lactobacillaceae | Lactobacillus | NA |
354905 | 6 | Lactobacillaceae | Lactobacillus | NA |
851726 | 68 | Veillonellaceae | Megasphaera | NA |
Sample scores plots. Length of the arrows indicates the samples that have larger differences across two data sets.
#> Warning: Ignoring unknown aesthetics: na.rm
#> Warning: Ignoring unknown aesthetics: na.rm
Samples with largest difference across two data sets. Samples with arrow lengths in 0.9 quantile are chosen.
#Taxa with major differences across two sets
large.difs <- rownames(Samp.coin$Dissimilarity[Samp.coin$Dissimilarity$Quantile >= 0.9, ])
Samples with largest differences across two data sets.
small.difs <- rownames(Samp.coin$Dissimilarity[Samp.coin$Dissimilarity$Quantile < 0.1, ])
Examine these taxa and cytokines
Morgan, Xochitl C, and Curtis Huttenhower. 2012. “Chapter 12: Human Microbiome Analysis.” PLoS Computational Biology 8 (12): e1002808. https://doi.org/10.1371/journal.pcbi.1002808.
Pasolli, Edoardo, Lucas Schiffer, Paolo Manghi, Audrey Renson, Valerie Obenchain, Duy Tin Truong, Francesco Beghini, et al. 2017. “Accessible, Curated Metagenomic Data Through ExperimentHub.” Nature Methods 14 (11). Nature Research: 1023–4. https://doi.org/10.1038/nmeth.4468.
Schiffer, Lucas, Rimsha Azhar, Lori Shepherd, Marcel Ramos, Ludwig Geistlinger, Curtis Huttenhower, Jennifer B Dowd, Nicola Segata, and Levi Waldron. 2019. “HMP16SData: Efficient Access to the Human Microbiome Project Through Bioconductor.” American Journal of Epidemiology, January. https://doi.org/10.1093/aje/kwz006.