Working with open-source Human Microbiome Project Data phases 1 (HMP) and 2 (iHMP): Efficient Data Access and Analysis Workflow

Syllabus

Instructors names and contact information

Levi Waldron, Graduate School of Public Health and Health Policy, City University of New York, New York, NY. email: Levi.Waldron@sph.cuny.edu
Ni Zhao, Department of Biostatistics, Johns Hopkins University, Baltimore, MD. email: nzhao10@jhu.edu
Ekaterina Smirnova, Department of Biostatistics, Virginia Commonwealth University, Richmond, VA. email: ekaterina.smirnova@vcuhealth.org

Workshop Description

The composition of microbial species in a human body is essential for maintaining human health, and it is associated with a number of diseases including obesity, bowel inflammatory disease, and bacterial vaginosis. Over the last decade, microbiome data analysis almost entirely shifted towards using samples taken directly from various sites of human body and to explore a large number of microbes using 16S or whole metagenome sequencing. This technological shift has led to a radical change in the data collected and led to the creation of the Human Microbiome Project (HMP) in 2008. With the growth and success of the microbiomics field, the size and complexity, and availability of the microbiome data in any given experiment have increased exponentially. Publicly-available data sets from amplicon sequencing of two 16S ribosomal RNA variable regions, with extensive controlled-access participant data, provide a reference for ongoing microbiome studies. However, utilization of these data sets can be hindered by the complex bioinformatic steps required to access, import, decrypt, and merge the various components in formats suitable for ecological and statistical analysis.

The main goal of this workshop is to provide a much-needed tutorial on downloading, understanding, and analyzing the publicly-available HMP data. We will describe the two packages, HMP16SData and HMP2Data, that provide the comprehensive tutorial on the analysis of Phase 1 and 2 of the HMP project, respectively. These packages provide count data for both 16S ribosomal RNA variable regions, integrated with phylogeny, taxonomy, public participant data, and controlled participant data for authorized researchers, using standard integrative Bioconductor data objects. As such, these packages provide the first comprehensive publicly-available tool for working with microbiome data collected at different body sites (e.g., fecal, nasal, vaginal) coupled with other omics approaches (e.g., cytokines, whole metagenome), and is conducted longitudinally on multiple subjects at multiple visits. By removing bioinformatic hurdles of data access and management, HMP16SData and HMP2Data enable researchers with only basic R skills to quickly analyze HMP data. This allows studying the dynamics of microbial composition over the disease progression, discovering disease-specific microbial biomarkers, and understanding the role of microbiome in connection to other omics measurements.

This workshop will consist of the series on the instructors-led live demos presented together with the brief lecture materials and statistical methods review when necessary. Completely worked out examples are available through the HMP16SData and HMP2Data package vignettes.

Pre-requisites

Basic knowledge of R syntax
Familiarity with the microbiome data
Familiarity with alpha and beta diversity in the context of microbiome data analysis
Familiarity with principal components and principal coordinates analysis

Background reading:

Schiffer L, Azhar R, Shepherd L, Ramos M, Geistlinger L, Huttenhower C, Dowd JB, Segata N, Waldron L (2019). “HMP16SData: Efficient Access to the Human Microbiome Project through Bioconductor.” American Journal of Epidemiology. doi: 10.1093/aje/kwz006.

Time outline

Activity	Time
HMP16SData	10m
HMP2Data	10m
Microbiome Analyses	15m
Multi-omics data integration	15m

Workshop goals and objectives

The main purpose of this workshop is to help researchers interested in public microbiome data to start using the HMP16SData and HMP2Data packages.

Learning goals

understand the goals of HMP projects and the studies available through the HMP DAC portal
download and use HMP data packages
understand the difference between phase 1 and phase 2 projects
understand principles of -omics data integration and visualization

Learning objectives

install HMP data packages and access vignettes
download HMP data to produce analyses illustrated in package vignettes
create sample summary statistics
create alpha and beta diversity plots
understand longitudinal patterns and examine microbiome diversity changes
integrate 16S and cytokines data using co-inertia analyses techniques
learn capturing patterns across -omics data sets of different structure

Setup

Installing all packages necessary for this workshop can be accomplished by the following command:

BiocManager::install("waldronlab/MicrobiomeWorkshop", build_vignettes=TRUE, 
        dependencies=TRUE)

Load the workshop package:

library(MicrobiomeWorkshop)

Then you should be able to view the compiled vignette:

vignette("MicrobiomeWorkshop")

Note, all packages are listed in the DESCRIPTION file of this workshop.

Also some utility packages:

suppressPackageStartupMessages({
  library(kableExtra)
  library(magrittr)
  library(reshape2)
  library(gridExtra)
})

Source some miscellaneous R scripts for HMP2Data component:

source(system.file(package='MicrobiomeWorkshop', 'vignettes', "CIAPlots.R"))
source(system.file(package='MicrobiomeWorkshop', 'vignettes', "ScreePlot.R"))

Introduction

Bioconductor provides curated resources of microbiome data. Most microbiome data are generated either by targeted amplicon sequencing (usually of variable regions of the 16S ribosomal RNA gene) or by metagenomic shotgun sequencing (MGX). These two approaches are analyzed by different sequence analysis tools, but downstream statistical and ecological analysis can involve any of the following types of data:

taxonomic abundance at different levels of the taxonomic hierarchy
phylogenetic distances and the phylogenetic tree of life
metabolic potential of the microbiome
abundance of microbial genes and gene families

A review of types and properties of microbiome data is provided by (Morgan and Huttenhower 2012).

curatedMetagenomicData: Curated and processed metagenomic data through ExperimentHub

curatedMetagenomicData(Pasolli et al. 2017) provides 6 types of processed data for >30 publicly available whole-metagenome shotgun sequencing datasets, including from the Human Microbiome Project Phase 1:

Species-level taxonomic profiles, expressed as relative abundance from kingdom to strain level
Presence of unique, clade-specific markers
Abundance of unique, clade-specific markers
Abundance of gene families
Metabolic pathway coverage
Metabolic pathway abundance

Types 1-3 are generated by MetaPhlAn2; 4-6 are generated by HUMAnN2.

Currently, curatedMetagenomicData provides:

8184 samples from 46 datasets, primarily of the human gut but including body sites profiled in the Human Microbiome Project
Processed data from whole-metagenome shotgun metagenomics, with manually-curated metadata, as integrated and documented Bioconductor ExpressionSet objects
~80 fields of specimen metadata from original papers, supplementary files, and websites, with manual curation to standardize annotations
Processing of data through the MetaPhlAn2 pipeline for taxonomic abundance, and HUMAnN2 pipeline for metabolic analysis
These represent ~100TB of raw sequencing data, but the processed data provided are much smaller.

These datasets are documented in the reference manual.

This is an ExperimentHub package, and its main workhorse function is curatedMetagenomicData():

The manually curated metadata for all available samples are provided in a single table combined_metadata:

library(curatedMetagenomicData)
?combined_metadata
View(data.frame(combined_metadata))

The main function provides a list of ExpressionSet objects:

oral <- c("HMP_2012.metaphlan_bugs_list.oralcavity")
esl <- curatedMetagenomicData(oral, dryrun = FALSE)
#> Working on HMP_2012.metaphlan_bugs_list.oralcavity
#> snapshotDate(): 2019-06-20
#> see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache 
#>     'EH2491 : 2491'

esl
#> List of length 1
#> names(1): HMP_2012.metaphlan_bugs_list.oralcavity

These ExpressionSet objects can also be converted to phyloseq object for ecological analysis and differential abundance analysis using the DESeq2 package, using the ExpressionSet2phyloseq() function:

ExpressionSet2phyloseq( esl[[1]], phylogenetictree = TRUE)
#> Loading required namespace: phyloseq
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 600 taxa and 415 samples ]
#> sample_data() Sample Data:       [ 415 samples by 18 sample variables ]
#> tax_table()   Taxonomy Table:    [ 600 taxa by 8 taxonomic ranks ]
#> phy_tree()    Phylogenetic Tree: [ 600 tips and 599 internal nodes ]

See the documentation of phyloseq for more on ecological and differential abundance analysis of the microbiome.

HMP16SData: 16S rRNA Sequencing Data from the Human Microbiome Project

suppressPackageStartupMessages(library(HMP16SData))
#> snapshotDate(): 2019-06-20

HMP16SData(Schiffer et al. 2019) is a Bioconductor ExperimentData package of the Human Microbiome Project (HMP) 16S rRNA sequencing data. Taxonomic count data files are provided as downloaded from the HMP Data Analysis and Coordination Center from its QIIME pipeline. Processed data is provided as SummarizedExperiment class objects via ExperimentHub. Like other ExperimentHub-based packages, a convenience function does downloading, automatic local caching, and serializing of a Bioconductor data class. This returns taxonomic counts from the V1-3 variable region of the 16S rRNA gene, along with the unrestricted participant data and phylogenetic tree.

V13()
#> snapshotDate(): 2019-06-20
#> see ?HMP16SData and browseVignettes('HMP16SData') for documentation
#> downloading 0 resources
#> loading from cache 
#>     'EH1117 : 1117'
#> class: SummarizedExperiment 
#> dim: 43140 2898 
#> metadata(2): experimentData phylogeneticTree
#> assays(1): 16SrRNA
#> rownames(43140): OTU_97.1 OTU_97.10 ... OTU_97.9997 OTU_97.9999
#> rowData names(7): CONSENSUS_LINEAGE SUPERKINGDOM ... FAMILY GENUS
#> colnames(2898): 700013549 700014386 ... 700114963 700114965
#> colData names(7): RSID VISITNO ... HMP_BODY_SUBSITE SRS_SAMPLE_ID

This can also be converted to phyloseq for ecological and differential abundance analysis; see the HMP16SData vignette for details.

HMP2Data

Integrative Human Microbiome Project (iHMP) https://hmpdacc.org/ihmp/

One of the largest open data resources for studying longitudinal microbiome changes and connection to disease
Novel data – first scientific results just published in the special collection of Nature and its sister journals (https://www.nature.com/articles/s41586-019-1238-8)

Human Microbiome Project Data Portal (https://portal.hmpdacc.org/)
Open data – but not straightforward to download and work with

Need Aspera client to even download the data – not every lab/researcher has these expertise available
Still need multiple processing steps after data is downloaded to:

map taxonomy ids from the database
merge with meta data – not available on the DAC portal
construct phylogenetic tree
merge omics data.

HMP2Data package

The following packages will be used in this vignette to provide demonstrative examples of what a user might do with HMP2Data.

suppressPackageStartupMessages({
  library(phyloseq)
  library(SummarizedExperiment)
  library(MultiAssayExperiment)
  library(dplyr)
  library(ggplot2)
  library(UpSetR)
  library(ade4)
  library(vegan)
  library(MiRKAT)
})

HMP2Data package is currently under review as a Bioconductor package. Current (development) version can be installed can be installed from John Stansfield’s GitHub repository https://github.com/jstansfield0/HMP2Data as follows. This is the version used for the workshop.

BiocManager::install("jstansfield0/HMP2Data")

The stable version is available at the dozmorovlab GitHub repository https://github.com/dozmorovlab/HMP2Data

BiocManager::install("dozmorovlab/HMP2Data")

Load HMP2Data package

library(HMP2Data)

Once installed, HMP2Data provides functions to access the data from the three major HMP2 studies.

Multi-omics microbiome study pregnancy initiavite (MOMS-PI)

Vaginal microbiome consortium study at Virginia Commonwealth University (VCU) http://vmc.vcu.edu/momspi

Pre-term birth study results: https://www.nature.com/articles/s41591-019-0450-2
Trem-birth study: https://www.nature.com/articles/s41591-019-0465-8
Only a subset of samples available on the DAC portal was used in current publications
Novel longitudinal and multi-omics models have not yet been throughly explored
Data structure gives many opportunities for statistical models development and data exploration

16S data

Load 16S data as a matrix, rows are Greengene IDs, columns are sample names:

data("momspi16S_mtx")

Load the Greengenes taxonomy table as a matrix, rows are Greengene IDs, columns are taxonomic ranks:

data("momspi16S_tax")
# Check if Greengene IDs match between the 16S and taxonomy data
# all.equal(rownames(momspi16S_mtx), rownames(momspi16S_tax)) # Should be TRUE

Load the 16S sample annotation data as a matrix, rows are samples, columns are annotations:

data("momspi16S_samp")

# Check if sample names match between the 16S and sample data
# all.equal(colnames(momspi16S_mtx), rownames(momspi16S_samp)) # Should be TRUE

The momspi16S function assembles those matrices into a phyloseq object.

momspi16S_phyloseq <- momspi16S()
momspi16S_phyloseq
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 7665 taxa and 9107 samples ]
#> sample_data() Sample Data:       [ 9107 samples by 13 sample variables ]
#> tax_table()   Taxonomy Table:    [ 7665 taxa by 7 taxonomic ranks ]

Cytokine data

The MOMS-PI cytokine data can be loaded as a matrix, rownames are cytokine names, colnames are sample names:

data("momspiCyto_mtx")
momspiCyto_mtx[1:5, 1:3]
#>         EP004835_K10_MVAX EP004835_K20_MVAX EP004835_K30_MVAX
#> Eotaxin             68.05             60.08             37.84
#> FGF                265.45             -2.00             -2.00
#> G-CSF              624.79           2470.70             -2.00
#> GM-CSF              51.39             -2.00             42.01
#> IFN-g              164.58            193.22            184.53

The cytokine data has 29 rows and 1396 columns.

Load the cytokine sample annotation data as a matrix, rows are samples, columns are annotations:

data("momspiCyto_samp")
colnames(momspiCyto_samp)
#>  [1] "file_id"          "md5"              "size"            
#>  [4] "urls"             "sample_id"        "file_name"       
#>  [7] "subject_id"       "sample_body_site" "visit_number"    
#> [10] "subject_gender"   "subject_race"     "study_full_name" 
#> [13] "project_name"

The function momspiCytokines will make a SummarizedExperiment containing cytokine data

momspiCyto <- momspiCytokines()
momspiCyto
#> class: SummarizedExperiment 
#> dim: 29 1396 
#> metadata(0):
#> assays(1): ''
#> rownames(29): Eotaxin FGF ... FGF basic IL-17
#> rowData names(1): cytokine
#> colnames(1396): EP004835_K10_MVAX EP004835_K20_MVAX ...
#>   EP996091_K40_MVAX EP996091_K60_MVAX
#> colData names(13): file_id md5 ... study_full_name project_name

IBD

Load 16S data as a matrix, rows are SILVA IDs, columns are sample names:

data("IBD16S_mtx")

Load the SILVA taxonomy table as a matrix, rows are SILVA IDs, columns are taxonomic ranks:

data("IBD16S_tax")
colnames(IBD16S_tax)
#> [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"

Load the 16S sample annotation data as a matrix, rows are samples, columns are annotations:

data("IBD16S_samp")
colnames(IBD16S_samp) %>% head()
#> [1] "Project"       "sample_id"     "subject_id"    "site_sub_coll"
#> [5] "data_type"     "week_num"

The IBD phyloseq object can be loaded as follows.

IBD <- IBD16S()
IBD
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 982 taxa and 178 samples ]
#> sample_data() Sample Data:       [ 178 samples by 490 sample variables ]
#> tax_table()   Taxonomy Table:    [ 982 taxa by 6 taxonomic ranks ]

T2D

Load 16S data as a matrix, rows are Greengene IDs, columns are sample names:

data("T2D16S_mtx")

Load the Greengenes taxonomy table as a matrix, rows are Greengene IDs, columns are taxonomic ranks:

data("T2D16S_tax")
colnames(T2D16S_tax)
#> [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

Load the 16S sample annotation data as a matrix, rows are samples, columns are annotations:

data("T2D16S_samp")
colnames(T2D16S_samp)
#>  [1] "file_id"          "md5"              "size"            
#>  [4] "urls"             "sample_id"        "file_name"       
#>  [7] "subject_id"       "sample_body_site" "visit_number"    
#> [10] "subject_gender"   "subject_race"     "study_full_name" 
#> [13] "project_name"

The T2D phyloseq object can be loaded like so.

T2D <- T2D16S()
T2D
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 12062 taxa and 2208 samples ]
#> sample_data() Sample Data:       [ 2208 samples by 13 sample variables ]
#> tax_table()   Taxonomy Table:    [ 12062 taxa by 7 taxonomic ranks ]

Features

Frequency Table Generation

The sample-level annotation data contianed in HMP2Data can be summarized using the table_two function. First, we need to make a list of the phyloseq and SummarizedExperiment objects which can then be entered into the table_two() table generating function.

list("MOMS-PI 16S" = momspi16S_phyloseq, "MOMS-PI Cytokines" = momspiCyto, "IBD 16S" = IBD, "T2D 16S" = T2D) %>% table_two()

	MOMS-PI 16S		MOMS-PI Cytokines		IBD 16S		T2D 16S
	N	%	N	%	N	%	N	%
Body Site
buccal mucosa	3313	36.38	311	22.28	0	0	0	0
cervix of uterus	162	1.78	0	0	0	0	0	0
feces	765	8.4	0	0	178	100	1041	47.15
rectum	2679	29.42	0	0	0	0	0	0
unknown	146	1.6	0	0	0	0	0	0
vagina	2042	22.42	979	70.13	0	0	0	0
blood cell	0	0	106	7.59	0	0	0	0
nasal cavity	0	0	0	0	0	0	1167	52.85
Sex
male	0	0	0	0	84	47.19	1248	56.52
female	9107	100	1396	100	94	52.81	947	42.89
unknown	0	0	0	0	0	0	13	0.59
Race
african american	0	0	0	0	5	2.81	117	5.3
asian	0	0	0	0	0	0	235	10.64
caucasian	0	0	0	0	164	92.13	1657	75.05
ethnic other	0	0	0	0	8	4.49	73	3.31
hispanic or latino	0	0	0	0	0	0	126	5.71
american indian or alaska native	0	0	0	0	1	0.56	0	0
unknown	9107	100	1396	100	0	0	0	0
total samples	9107	100	1396	100	178	100	2208	100

Visits Histograms

We can plot the histogram of the number of samples at each visit for all studies. Note that the X-axis is capped at count of 40 for better visualization.

# make data.frame for plotting
plot_visits <- data.frame(study = c(rep("MOMS-PI Cytokines", nrow(momspiCyto_samp)),
                     rep("IBD 16S", nrow(IBD16S_samp)),
                     rep("T2D 16S", nrow(T2D16S_samp))),
          visits = c(momspiCyto_samp$visit_number,
                     IBD16S_samp$visit_number,
                     T2D16S_samp$visit_number))
p2 <- ggplot(plot_visits, aes(x = visits, fill = study)) + 
  geom_histogram(position = "dodge", alpha = 0.7, bins = 30, color = "#00BFC4") + xlim(c(0, 40)) +
  theme(legend.position = c(0.8, 0.8))  + 
  xlab("Visit number") + ylab("Count")
p2
#> Warning: Removed 677 rows containing non-finite values (stat_bin).
#> Warning: Removed 4 rows containing missing values (geom_bar).

Note that there are 677 samples at visit numbers over 40.

UpsetR plots

MOMS-PI 16S rRNA data

Here we plot the body sites which samples were taken from for each patient in the MOMS-PI 16S data.

# set up data.frame for UpSetR
momspi_upset <- aggregate(momspi16S_samp$sample_body_site, by = list(momspi16S_samp$subject_id), table)
tmp <- as.matrix(momspi_upset[, -1])
tmp <- (tmp > 0) *1
momspi_upset <- data.frame(patient = momspi_upset$Group.1, tmp)

# plot
upset(momspi_upset, order.by = 'freq', matrix.color = "blue", text.scale = 2)

MOMS-PI Cytokines data

Here we plot the body sites which samples were taken from for each patient in the MOMS-PI cytokines data.

# set up data.frame for UpSetR
momspiCyto_upset <- aggregate(momspiCyto_samp$sample_body_site, by = list(momspiCyto_samp$subject_id), table)
tmp <- as.matrix(momspiCyto_upset[, -1])
tmp <- (tmp > 0) *1
momspiCyto_upset <- data.frame(patient = momspiCyto_upset$Group.1, tmp)

# plot
upset(momspiCyto_upset, order.by = 'freq', matrix.color = "blue", text.scale = 2)

IBD data

The IBD patients only had samples taken from the feces.

T2D data

Here we plot the body sites which samples were taken from for each patient in the T2D 16S rRNA data.

# set up data.frame for UpSetR
T2D_upset <- aggregate(T2D16S_samp$sample_body_site, by = list(T2D16S_samp$subject_id), table)
tmp <- as.matrix(T2D_upset[, -1])
tmp <- (tmp > 0) *1
T2D_upset <- data.frame(patient = T2D_upset$Group.1, tmp)

# plot
upset(T2D_upset, order.by = 'freq', matrix.color = "blue", text.scale = 2)

Microbiome Analyses: IBD example


Table1Var <- c("subject_gender", "Age.at.diagnosis", "race", "visit_number",
               "site_name", "Education.Level", "Antibiotics", "BMI")

demo_data = sample_data(IBD)[ ,c("sample_id", "subject_id", Table1Var, "diagnosis")] %>%
            data.frame()

#all <- CreateTableOne(vars = Table1Var, data = demo_data)
#kableone(all)

stratified = tableone::CreateTableOne(
  vars = Table1Var,
  data = summarytools::unlabel(demo_data), strata = "diagnosis", includeNA = TRUE)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> Warning in ModuleReturnVarsExist(vars, data): These variables only have NA/
#> NaN: BMI Dropped
#> Warning in min(x, na.rm = TRUE): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf
#> Warning in StdDiff(variable = var, group = strataVar): Variable has only
#> NA's in at least one stratum. na.rm turned off.
stratified <- print(stratified, printToggle = FALSE, showAllLevels = FALSE)
stratified[,!(colnames(stratified) %in% "test")]%>%
  knitr::kable(format = "html", caption = "Characteristics of the Study Cohort",
               col.names = c("CD", "nonIBD", 
                             "UC", "p-value"))%>% 
  kable_styling("striped", full_width = T)

Table 1: Characteristics of the Study Cohort
	CD	nonIBD	UC	p-value
n	86	46	46
subject_gender = male (%)	43 ( 50.0)	24 ( 52.2)	17 ( 37.0)	0.264
Age.at.diagnosis (mean (SD))	21.34 (11.31)	NaN (NA)	25.65 (15.21)	0.067
race (%)				0.002
American Indian or Alaska Native	1 ( 1.2)	0 ( 0.0)	0 ( 0.0)
Black or African American	0 ( 0.0)	0 ( 0.0)	5 ( 10.9)
More than one race	0 ( 0.0)	2 ( 4.3)	2 ( 4.3)
Other	4 ( 4.7)	0 ( 0.0)	0 ( 0.0)
White	81 ( 94.2)	44 ( 95.7)	39 ( 84.8)
visit_number (mean (SD))	5.33 (10.80)	2.35 (6.39)	1.67 (4.57)	0.034
site_name (%)				<0.001
Cedars-Sinai	21 ( 24.4)	2 ( 4.3)	23 ( 50.0)
Cincinnati	28 ( 32.6)	20 ( 43.5)	11 ( 23.9)
MGH	24 ( 27.9)	18 ( 39.1)	6 ( 13.0)
MGH Pediatrics	13 ( 15.1)	6 ( 13.0)	6 ( 13.0)
Education.Level (%)				0.001
7th grade or less	14 ( 16.3)	18 ( 39.1)	7 ( 15.2)
Bachelor’s degree	9 ( 10.5)	10 ( 21.7)	4 ( 8.7)
High school graduate or GED	5 ( 5.8)	2 ( 4.3)	0 ( 0.0)
Master’s degree	0 ( 0.0)	2 ( 4.3)	4 ( 8.7)
Professional/Doctoral degree	6 ( 7.0)	2 ( 4.3)	4 ( 8.7)
Some college, no degree	12 ( 14.0)	0 ( 0.0)	7 ( 15.2)
Some high school	27 ( 31.4)	8 ( 17.4)	10 ( 21.7)
Unknown/Not Reported	11 ( 12.8)	2 ( 4.3)	10 ( 21.7)
NA	2 ( 2.3)	2 ( 4.3)	0 ( 0.0)
Antibiotics = No (%)	86 (100.0)	46 (100.0)	46 (100.0)	NA

###Alpha diversity analysis

Alpha Diversity Definitions

Alpha diversity is a within-sample diversity measurement that models the richness and/or eveness of the microbial community. Different alpha diversities emphasize different aspect of the community. Commonly used alpha diversities include:

Diversity	Description	Formula
Observed Species	No. of unique taxa in a sample	Number of species
Chao 1	Adding a correction to observed species	$S_{ob} + \frac{n_1^2}{2n_2}$
Shannon	Both richness and evenness	$-\sum_{i=1}^{s} p_i \log(p_i)$
Simpson reciprocal	Both richness and evenness	$\frac{N(N - 1)}{n(n-1)}$
Pielou’s evenness	evenness	$\frac{H}{H_{max}}$

Of course, the alpha diversity measurement is impacted by the sequencing depth (total number of reads per sample). Rarefying (through vegan:rarefy) is necessary before calculating alpha diversity.

Comparing Alpha Diversities between Groups

We can use “plot_richness” from phyloseq package to plot the alpha diversities.

IBD %<>%
  taxa_sums() %>%
  is_greater_than(0) %>%
  prune_taxa(IBD)
sample_data(IBD)$diagnosis = factor(sample_data(IBD)$diagnosis)

p = plot_richness(IBD, x="diagnosis", color="diagnosis", measures=c("Observed", "InvSimpson", "Shannon", "Chao1")) + geom_jitter()
p + geom_boxplot(data = p$data, aes(x = diagnosis, y = value, color = NULL), alpha = 0.1)
#> Warning: Removed 534 rows containing missing values (geom_errorbar).

Once we summarize the whole community through a one-dimensional summary statistics (alpha diversity), all the traditional methods for hypothesis testing are applicable.

alpha = estimate_richness(IBD, measures=c("Observed", "InvSimpson", "Shannon"))
alpha_df = data.frame(sample_data(IBD)[,c("sample_id","subject_id")], alpha)
rownames(alpha_df) = NULL
kable(head(alpha_df),
      digits = 3,caption = "Some Alpha Diversities of the IBD dataset") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F,position = "center")

Table 2: Some Alpha Diversities of the IBD dataset
sample_id	subject_id	Observed	Shannon	InvSimpson
206534	M2008	112	2.572	5.590
206536	M2008	96	2.534	5.565
206538	M2008	80	2.655	6.452
206547	M2014	77	2.733	7.030
206548	M2014	69	2.632	6.556
206561	M2021	63	3.357	16.324

Beta diversity analysis

Beta Diversity Definitions

Beta diversity describes how samples vary against each other. Beta diversity is typically the thinking behind “clustering” algorithms that show differences or similarities among samples. For example, we may be intrested in the differences in gut microbiome between non IBD and IBD patients.

List of the commonly used beta-diversities:

Diversity	Description	Formula
UniFrac	Qualitative, Phylogenetics-based	$\sum_{i=1}^n\frac{b_i\|I(p_i^A > 0)-I(p_i^B > 0)\|}{\sum_{i=1}^n b_i}$
Weighted UniFrac	Quantitative,Phylogenetics-based	$\frac{\sum_{i=1}^n b_i\|p_i^A - p_i^B\|}{\sum_{i=1}^nb_i(p_i^A + p_i^B)}$
Generalized UniFrac	Compromise between the previous two	$\frac{\sum_{i=1}^n b_i (p_i^A + p_i^B)^{\alpha} \|\frac{p_i^A - p_i^B}{p_i^A + p_i^B}\|} {\sum_{i=1}^nb_i(p_i^A + p_i^B)^{\alpha}}$
Bray-Curtis	Quantitative	$1 - \frac{2C_{i,j}}{S_i + S_j}$
Jaccard	Qualitative	$\frac{\|A \cap B\|}{\|A \cup B\|}$

Comparing Beta Diversities between Groups


par(mfrow = c(1,2))
jac <- as.matrix(distance(t(otu_table(IBD)), method = "jaccard"))
mod1 <- vegan::betadisper(as.dist(jac), as.factor(IBD16S_samp$diagnosis))
plot(mod1, ellipse = TRUE, hull = FALSE, main="Jaccard Distance", xlab="PC 1", ylab="PC 2", sub=NULL)
box()

bc = as.matrix(ecodist::bcdist(t(otu_table(IBD))))
mod2 <- vegan::betadisper(as.dist(bc), as.factor(IBD16S_samp$diagnosis))
plot(mod2, ellipse = TRUE, hull = FALSE, main="Bray-Curtis Dissimilarity", xlab="PC 1", ylab="PC 2", sub=NULL, xlim = c(-0.2, 0.2))
legend("bottomright", legend=c("CD", "nonIBD", "UC"), pch = 1:3, col=c("Black", "red", "Green"), cex=1.3, box.lty=0)
box()

#####Permutational Multivariate Analysis of Variance

Permutational multivariate analysis of variance (PERMANOVA) (???), a non-parametric multivariate statistical test, is one commonly used for assessing the association between microbiome community and a phenotype (e.g., IBD or not).

However, the PERMANOVA approach, by itself, doesn’t allow for repeated measures. Ignoring the repeated measures (treating them as if they are all independent samples) can lead to inflated type I error (i.e., more false positive discoveries than we would like to accept through our p-value cutoff). However, we can use the method if we consider a subset of the samples which are independent with each other, such as those from the baseline.

PERMANOVA requires the selection of a single distance metric. Howeve, if the selected distance metric doesn’t capture the underlying association pattern, PERMANOVA tends to lose power.

IBD0 = subset_samples(IBD,  week_num == 0&diagnosis %in% c("nonIBD", "CD"))
IBD0 %<>%
  taxa_sums() %>%
  is_greater_than(0) %>%
  prune_taxa(IBD0)

jac0 <- as.matrix(distance(t(otu_table(IBD0)), method = "jaccard"))
bc0 =  as.matrix(ecodist::bcdist(t(otu_table(IBD0))))
CD = as.numeric(sample_data(IBD0)$diagnosis == "CD")
gender = as.numeric(factor(sample_data(IBD0)$subject_gender))

fit = vegan::adonis(bc0 ~ gender + CD, permutations=500)
kable(fit$aov.tab,
      digits = 3,caption = "PERMANOVA result using Bray-Curtis distance") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

Table 3: PERMANOVA result using Bray-Curtis distance
	Df	SumsOfSqs	MeanSqs	F.Model	R2	Pr(>F)
gender	1	0.421	0.421	1.288	0.017	0.178
CD	1	0.740	0.740	2.260	0.030	0.010
Residuals	71	23.237	0.327	NA	0.952	NA
Total	73	24.398	NA	NA	1.000	NA

Microbiome Regression Kernel Association Test (MiRKAT)

MiRKAT: a regression approach for flexibly associating the microbial community with phenotype of interest (???). The method also has extensions that can handle longitudinal measures (???), multivariate (???) and survial outcomes (???).

g(y_i) = X_iβ + f(M_i) + ϵ_i, f∼ a reproducing kernel Hilbert space generated by function K. Equivalently, f ∼ N(0, τK), in which K is a similarity matrix that can be calculated through a distance metric.

Multiple kernels (distances) are allowed simulatenously in this framework, making this procedure robust to the underlying association patterns.

 K_bc = MiRKAT::D2K(bc0); K_jac = D2K(jac0)
 y = ifelse(sample_data(IBD0)$diagnosis == "CD",1, 0 )
 out = MiRKAT::MiRKAT(y=y, X = gender, Ks = list(K_bc, K_jac), out_type = "D")

Multi-omics data integration: MOMS-PI example

Combine 16S and cytokines data

#order both sets by visit number within a subject
momspi16S_samp <- momspi16S_samp[
  with(momspi16S_samp, order(subject_id, sample_body_site, visit_number)),
] 

momspiCyto_samp <- momspiCyto_samp[
  with(momspiCyto_samp, order(subject_id, sample_body_site, visit_number)),
]

#select data collected at the same visit
combined_samp <- merge(momspi16S_samp, momspiCyto_samp, 
                       by = c("subject_id", "sample_body_site", 
                        "project_name", "study_full_name",
                        "subject_gender", "subject_race",
                        "visit_number"))

In this tutorial, we concentrate on cross-sectional analysis of the first visit data, which corresponds to earlier time during pregnancy.

#select data from first visit only
combined_samp <- combined_samp[combined_samp$visit_number ==  1,]

table(combined_samp$sample_body_site)#all vaginal samples
#> 
#> vagina 
#>    225

We match the samples (contained in columns of both tables) by the file names contained in colnames of each table.

#> [1]  29 225

In ‘combined_samp’ object the names of matched files names for 16S data are recorded in column ‘file.x’ and for cytokines data in column ‘file.y’.

Make sure that samples are in rows and variables (taxa and cytokines) are in columns.

combined_16S_mtx <- t(combined_16S_mtx)
combined_Cyto_mtx <- t(combined_Cyto_mtx)

Taxa are converted to proportions.

combined_16S_mtx <- combined_16S_mtx/apply(combined_16S_mtx, 1, sum)

#> [1] TRUE

Co-inertia analysis

Basics:

Let X and Y be 16S and cytokines tables respectively
Rows: same n women at first visit
Columns: p₁ taxa and p₂ cytokines
Column weights: Q_X and Q_Y
Row weights: D
PCA analysis of each table: (X, Q_X, D) and (Y, Q_Y, D)
Co-inertia axes: Y^TDX = KΛ^1/2A^T from eigendecomposition of (Y^TDX, Q_X, Q_Y)
Plot F_X = XA and F_Y = YK

Taxa normalization:

Center 16S data to work with PCA on the covariance matrix Σ_X = Cov(X)
To normalize the magnitude, divide each value of X by the total variance: $\sqrt{\mbox{tr}(\Sigma_X)}$
Note: step 2 is quivalent to dividing the matrix by $\sqrt{\sum_{k=1}^r \lambda_k}$, where λ_k are the eigevalues of Σ_X and r is the rank of X.

taxa_mtx <- scale(combined_16S_mtx, center = TRUE, scale = FALSE)
#use fast trace computation formula: tr(A^B) = sum(A*B), where '*' operator refers to elemetwise product
taxa_tr <- sum(taxa_mtx*taxa_mtx)/(dim(taxa_mtx)[1]-1)
taxa_mtx <- taxa_mtx/sqrt(taxa_tr)
taxa.pca <- ade4::dudi.pca(taxa_mtx, scannf=FALSE, nf =61,
                     center = FALSE, scale = FALSE)

Cytokines normalization:

Center and scale cytokines data to work with PCA on the correlation matrix Σ_X = Cor(Y)
To normalize the magnitude, divide each value of Y by the total variance: $\sqrt{\mbox{tr}(\Sigma_Y)}$

cyto_mtx <- scale(combined_Cyto_mtx, center = TRUE, scale = TRUE)
cyto_tr <- sum(cyto_mtx*cyto_mtx)/(dim(cyto_mtx)[1]-1)
cyto_mtx <- cyto_mtx/sqrt(cyto_tr)
cyto.pca <- ade4::dudi.pca(cyto_mtx, scannf=FALSE, nf =61,
                     center = FALSE, scale = FALSE)

Combine the tables using co-inertia

Co-inertia is available through R package ade4. It takes ade4 PCA objects and performes joint eigendecomposition.

coin <- ade4::coinertia(taxa.pca, cyto.pca, scannf = FALSE, nf = 2)

RV coefficient – measure of similarity between 16S and cytokines tables

RV <- coin$RV
RV
#> [1] 0.0396224

Visualization of output

Interpretation of the plots is similar to examining PCA results.

Plot below provides variables projection on common co-inertia axes.

Variables projected close to each other reflect similarilty within and across two data sets
Variables with larger values on each component (x- and y-axes) have more importance

We start by identifying important taxa (larger component values):

visually from the plot, or

taxa.inx <- c(6,  19,  24,  28,  31,  68,  69, 396)

by specifying x and y coordinates to pull all variables with coordinates larger (or smaller) than these values. Table 1 (taxa) values can be accessed $co object of coinertia output named coin.

kable(head(coin$co),
      digits = 5)%>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

	Comp1	Comp2
807795	0.00002	0.00002
134726	-0.00006	0.00003
215097	0.00000	0.00000
542066	-0.00001	-0.00004
851634	0.00001	0.00000
354905	0.01164	-0.00768

Similarly, table 2 (here cytokines) variables loadings can be extracted from the $li object of coinertia output named coin.

kable(head(coin$li),
      digits = 5)%>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

	Axis1	Axis2
Eotaxin	-0.01231	-0.00518
FGF	-0.00549	-0.00471
G-CSF	0.00815	-0.01028
GM-CSF	0.01838	-0.00531
IFN-g	-0.00426	-0.00404
IL-10	-0.00591	-0.00791

Table 4: Influential taxa
Greengene ID	Plot ID	Family	Genus	Species
134467	19	Lactobacillaceae	Lactobacillus	NA
137183	28	Bifidobacteriaceae	Gardnerella	NA
137580	24	Lactobacillaceae	Lactobacillus	NA
29566	69	Leptotrichiaceae	Sneathia	NA
309133	396	Enterococcaceae	NA	NA
332718	31	Lactobacillaceae	Lactobacillus	NA
354905	6	Lactobacillaceae	Lactobacillus	NA
851726	68	Veillonellaceae	Megasphaera	NA

Sample scores plots. Length of the arrows indicates the samples that have larger differences across two data sets.

#> Warning: Ignoring unknown aesthetics: na.rm

#> Warning: Ignoring unknown aesthetics: na.rm

Samples with largest difference across two data sets. Samples with arrow lengths in 0.9 quantile are chosen.

#Taxa with major differences across two sets
large.difs <- rownames(Samp.coin$Dissimilarity[Samp.coin$Dissimilarity$Quantile >= 0.9, ])

Samples with largest differences across two data sets.

small.difs <- rownames(Samp.coin$Dissimilarity[Samp.coin$Dissimilarity$Quantile < 0.1, ])

Examine these taxa and cytokines

Morgan, Xochitl C, and Curtis Huttenhower. 2012. “Chapter 12: Human Microbiome Analysis.” PLoS Computational Biology 8 (12): e1002808. https://doi.org/10.1371/journal.pcbi.1002808.

Pasolli, Edoardo, Lucas Schiffer, Paolo Manghi, Audrey Renson, Valerie Obenchain, Duy Tin Truong, Francesco Beghini, et al. 2017. “Accessible, Curated Metagenomic Data Through ExperimentHub.” Nature Methods 14 (11). Nature Research: 1023–4. https://doi.org/10.1038/nmeth.4468.

Schiffer, Lucas, Rimsha Azhar, Lori Shepherd, Marcel Ramos, Ludwig Geistlinger, Curtis Huttenhower, Jennifer B Dowd, Nicola Segata, and Levi Waldron. 2019. “HMP16SData: Efficient Access to the Human Microbiome Project Through Bioconductor.” American Journal of Epidemiology, January. https://doi.org/10.1093/aje/kwz006.