Working with open-source Human Microbiome Project Data phases 1 (HMP) and 2 (iHMP): Efficient Data Access and Analysis Workflow

Syllabus

Instructors names and contact information

  1. Levi Waldron, Graduate School of Public Health and Health Policy, City University of New York, New York, NY. email:

  2. Ni Zhao, Department of Biostatistics, Johns Hopkins University, Baltimore, MD. email:

  3. Ekaterina Smirnova, Department of Biostatistics, Virginia Commonwealth University, Richmond, VA. email:

Workshop Description

The composition of microbial species in a human body is essential for maintaining human health, and it is associated with a number of diseases including obesity, bowel inflammatory disease, and bacterial vaginosis. Over the last decade, microbiome data analysis almost entirely shifted towards using samples taken directly from various sites of human body and to explore a large number of microbes using 16S or whole metagenome sequencing. This technological shift has led to a radical change in the data collected and led to the creation of the Human Microbiome Project (HMP) in 2008. With the growth and success of the microbiomics field, the size and complexity, and availability of the microbiome data in any given experiment have increased exponentially. Publicly-available data sets from amplicon sequencing of two 16S ribosomal RNA variable regions, with extensive controlled-access participant data, provide a reference for ongoing microbiome studies. However, utilization of these data sets can be hindered by the complex bioinformatic steps required to access, import, decrypt, and merge the various components in formats suitable for ecological and statistical analysis.

The main goal of this workshop is to provide a much-needed tutorial on downloading, understanding, and analyzing the publicly-available HMP data. We will describe the two packages, HMP16SData and HMP2Data, that provide the comprehensive tutorial on the analysis of Phase 1 and 2 of the HMP project, respectively. These packages provide count data for both 16S ribosomal RNA variable regions, integrated with phylogeny, taxonomy, public participant data, and controlled participant data for authorized researchers, using standard integrative Bioconductor data objects. As such, these packages provide the first comprehensive publicly-available tool for working with microbiome data collected at different body sites (e.g., fecal, nasal, vaginal) coupled with other omics approaches (e.g., cytokines, whole metagenome), and is conducted longitudinally on multiple subjects at multiple visits. By removing bioinformatic hurdles of data access and management, HMP16SData and HMP2Data enable researchers with only basic R skills to quickly analyze HMP data. This allows studying the dynamics of microbial composition over the disease progression, discovering disease-specific microbial biomarkers, and understanding the role of microbiome in connection to other omics measurements.

This workshop will consist of the series on the instructors-led live demos presented together with the brief lecture materials and statistical methods review when necessary. Completely worked out examples are available through the HMP16SData and HMP2Data package vignettes.

Pre-requisites

  • Basic knowledge of R syntax
  • Familiarity with the microbiome data
  • Familiarity with alpha and beta diversity in the context of microbiome data analysis
  • Familiarity with principal components and principal coordinates analysis

Background reading:

  • Schiffer L, Azhar R, Shepherd L, Ramos M, Geistlinger L, Huttenhower C, Dowd JB, Segata N, Waldron L (2019). “HMP16SData: Efficient Access to the Human Microbiome Project through Bioconductor.” American Journal of Epidemiology. doi: 10.1093/aje/kwz006.

Time outline

Activity Time
HMP16SData 10m
HMP2Data 10m
Microbiome Analyses 15m
Multi-omics data integration 15m

Workshop goals and objectives

The main purpose of this workshop is to help researchers interested in public microbiome data to start using the HMP16SData and HMP2Data packages.

Learning goals

  • understand the goals of HMP projects and the studies available through the HMP DAC portal
  • download and use HMP data packages
  • understand the difference between phase 1 and phase 2 projects
  • understand principles of -omics data integration and visualization

Learning objectives

  • install HMP data packages and access vignettes
  • download HMP data to produce analyses illustrated in package vignettes
  • create sample summary statistics
  • create alpha and beta diversity plots
  • understand longitudinal patterns and examine microbiome diversity changes
  • integrate 16S and cytokines data using co-inertia analyses techniques
  • learn capturing patterns across -omics data sets of different structure

Setup

Installing all packages necessary for this workshop can be accomplished by the following command:

BiocManager::install("waldronlab/MicrobiomeWorkshop", build_vignettes=TRUE, 
        dependencies=TRUE)

Load the workshop package:

library(MicrobiomeWorkshop)

Then you should be able to view the compiled vignette:

vignette("MicrobiomeWorkshop")

Note, all packages are listed in the DESCRIPTION file of this workshop.

Also some utility packages:

suppressPackageStartupMessages({
  library(kableExtra)
  library(magrittr)
  library(reshape2)
  library(gridExtra)
})

Source some miscellaneous R scripts for HMP2Data component:

source(system.file(package='MicrobiomeWorkshop', 'vignettes', "CIAPlots.R"))
source(system.file(package='MicrobiomeWorkshop', 'vignettes', "ScreePlot.R"))

Introduction

Bioconductor provides curated resources of microbiome data. Most microbiome data are generated either by targeted amplicon sequencing (usually of variable regions of the 16S ribosomal RNA gene) or by metagenomic shotgun sequencing (MGX). These two approaches are analyzed by different sequence analysis tools, but downstream statistical and ecological analysis can involve any of the following types of data:

  • taxonomic abundance at different levels of the taxonomic hierarchy
  • phylogenetic distances and the phylogenetic tree of life
  • metabolic potential of the microbiome
  • abundance of microbial genes and gene families

A review of types and properties of microbiome data is provided by (Morgan and Huttenhower 2012).

curatedMetagenomicData: Curated and processed metagenomic data through ExperimentHub

curatedMetagenomicData(Pasolli et al. 2017) provides 6 types of processed data for >30 publicly available whole-metagenome shotgun sequencing datasets, including from the Human Microbiome Project Phase 1:

  1. Species-level taxonomic profiles, expressed as relative abundance from kingdom to strain level
  2. Presence of unique, clade-specific markers
  3. Abundance of unique, clade-specific markers
  4. Abundance of gene families
  5. Metabolic pathway coverage
  6. Metabolic pathway abundance

Types 1-3 are generated by MetaPhlAn2; 4-6 are generated by HUMAnN2.

Currently, curatedMetagenomicData provides:

  • 8184 samples from 46 datasets, primarily of the human gut but including body sites profiled in the Human Microbiome Project
  • Processed data from whole-metagenome shotgun metagenomics, with manually-curated metadata, as integrated and documented Bioconductor ExpressionSet objects
  • ~80 fields of specimen metadata from original papers, supplementary files, and websites, with manual curation to standardize annotations
  • Processing of data through the MetaPhlAn2 pipeline for taxonomic abundance, and HUMAnN2 pipeline for metabolic analysis
  • These represent ~100TB of raw sequencing data, but the processed data provided are much smaller.

These datasets are documented in the reference manual.

This is an ExperimentHub package, and its main workhorse function is curatedMetagenomicData():

The manually curated metadata for all available samples are provided in a single table combined_metadata:

library(curatedMetagenomicData)
?combined_metadata
View(data.frame(combined_metadata))

The main function provides a list of ExpressionSet objects:

oral <- c("HMP_2012.metaphlan_bugs_list.oralcavity")
esl <- curatedMetagenomicData(oral, dryrun = FALSE)
#> Working on HMP_2012.metaphlan_bugs_list.oralcavity
#> snapshotDate(): 2019-06-20
#> see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache 
#>     'EH2491 : 2491'
esl
#> List of length 1
#> names(1): HMP_2012.metaphlan_bugs_list.oralcavity

These ExpressionSet objects can also be converted to phyloseq object for ecological analysis and differential abundance analysis using the DESeq2 package, using the ExpressionSet2phyloseq() function:

ExpressionSet2phyloseq( esl[[1]], phylogenetictree = TRUE)
#> Loading required namespace: phyloseq
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 600 taxa and 415 samples ]
#> sample_data() Sample Data:       [ 415 samples by 18 sample variables ]
#> tax_table()   Taxonomy Table:    [ 600 taxa by 8 taxonomic ranks ]
#> phy_tree()    Phylogenetic Tree: [ 600 tips and 599 internal nodes ]

See the documentation of phyloseq for more on ecological and differential abundance analysis of the microbiome.

HMP16SData: 16S rRNA Sequencing Data from the Human Microbiome Project

suppressPackageStartupMessages(library(HMP16SData))
#> snapshotDate(): 2019-06-20

HMP16SData(Schiffer et al. 2019) is a Bioconductor ExperimentData package of the Human Microbiome Project (HMP) 16S rRNA sequencing data. Taxonomic count data files are provided as downloaded from the HMP Data Analysis and Coordination Center from its QIIME pipeline. Processed data is provided as SummarizedExperiment class objects via ExperimentHub. Like other ExperimentHub-based packages, a convenience function does downloading, automatic local caching, and serializing of a Bioconductor data class. This returns taxonomic counts from the V1-3 variable region of the 16S rRNA gene, along with the unrestricted participant data and phylogenetic tree.

V13()
#> snapshotDate(): 2019-06-20
#> see ?HMP16SData and browseVignettes('HMP16SData') for documentation
#> downloading 0 resources
#> loading from cache 
#>     'EH1117 : 1117'
#> class: SummarizedExperiment 
#> dim: 43140 2898 
#> metadata(2): experimentData phylogeneticTree
#> assays(1): 16SrRNA
#> rownames(43140): OTU_97.1 OTU_97.10 ... OTU_97.9997 OTU_97.9999
#> rowData names(7): CONSENSUS_LINEAGE SUPERKINGDOM ... FAMILY GENUS
#> colnames(2898): 700013549 700014386 ... 700114963 700114965
#> colData names(7): RSID VISITNO ... HMP_BODY_SUBSITE SRS_SAMPLE_ID

This can also be converted to phyloseq for ecological and differential abundance analysis; see the HMP16SData vignette for details.

HMP2Data

Integrative Human Microbiome Project (iHMP) https://hmpdacc.org/ihmp/

  • One of the largest open data resources for studying longitudinal microbiome changes and connection to disease

  • Novel data – first scientific results just published in the special collection of Nature and its sister journals (https://www.nature.com/articles/s41586-019-1238-8)

  • Need Aspera client to even download the data – not every lab/researcher has these expertise available
  • Still need multiple processing steps after data is downloaded to:
  1. map taxonomy ids from the database

  2. merge with meta data – not available on the DAC portal

  3. construct phylogenetic tree

  4. merge omics data.

HMP2Data package

The following packages will be used in this vignette to provide demonstrative examples of what a user might do with HMP2Data.

suppressPackageStartupMessages({
  library(phyloseq)
  library(SummarizedExperiment)
  library(MultiAssayExperiment)
  library(dplyr)
  library(ggplot2)
  library(UpSetR)
  library(ade4)
  library(vegan)
  library(MiRKAT)
})

HMP2Data package is currently under review as a Bioconductor package. Current (development) version can be installed can be installed from John Stansfield’s GitHub repository https://github.com/jstansfield0/HMP2Data as follows. This is the version used for the workshop.

BiocManager::install("jstansfield0/HMP2Data")

The stable version is available at the dozmorovlab GitHub repository https://github.com/dozmorovlab/HMP2Data

BiocManager::install("dozmorovlab/HMP2Data")

Load HMP2Data package

library(HMP2Data)

Once installed, HMP2Data provides functions to access the data from the three major HMP2 studies.

Multi-omics microbiome study pregnancy initiavite (MOMS-PI)

16S data

Load 16S data as a matrix, rows are Greengene IDs, columns are sample names:

data("momspi16S_mtx")

Load the Greengenes taxonomy table as a matrix, rows are Greengene IDs, columns are taxonomic ranks:

data("momspi16S_tax")
# Check if Greengene IDs match between the 16S and taxonomy data
# all.equal(rownames(momspi16S_mtx), rownames(momspi16S_tax)) # Should be TRUE

Load the 16S sample annotation data as a matrix, rows are samples, columns are annotations:

data("momspi16S_samp")

# Check if sample names match between the 16S and sample data
# all.equal(colnames(momspi16S_mtx), rownames(momspi16S_samp)) # Should be TRUE

The momspi16S function assembles those matrices into a phyloseq object.

momspi16S_phyloseq <- momspi16S()
momspi16S_phyloseq
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 7665 taxa and 9107 samples ]
#> sample_data() Sample Data:       [ 9107 samples by 13 sample variables ]
#> tax_table()   Taxonomy Table:    [ 7665 taxa by 7 taxonomic ranks ]
Cytokine data

The MOMS-PI cytokine data can be loaded as a matrix, rownames are cytokine names, colnames are sample names:

data("momspiCyto_mtx")
momspiCyto_mtx[1:5, 1:3]
#>         EP004835_K10_MVAX EP004835_K20_MVAX EP004835_K30_MVAX
#> Eotaxin             68.05             60.08             37.84
#> FGF                265.45             -2.00             -2.00
#> G-CSF              624.79           2470.70             -2.00
#> GM-CSF              51.39             -2.00             42.01
#> IFN-g              164.58            193.22            184.53

The cytokine data has 29 rows and 1396 columns.

Load the cytokine sample annotation data as a matrix, rows are samples, columns are annotations:

data("momspiCyto_samp")
colnames(momspiCyto_samp)
#>  [1] "file_id"          "md5"              "size"            
#>  [4] "urls"             "sample_id"        "file_name"       
#>  [7] "subject_id"       "sample_body_site" "visit_number"    
#> [10] "subject_gender"   "subject_race"     "study_full_name" 
#> [13] "project_name"

The function momspiCytokines will make a SummarizedExperiment containing cytokine data

momspiCyto <- momspiCytokines()
momspiCyto
#> class: SummarizedExperiment 
#> dim: 29 1396 
#> metadata(0):
#> assays(1): ''
#> rownames(29): Eotaxin FGF ... FGF basic IL-17
#> rowData names(1): cytokine
#> colnames(1396): EP004835_K10_MVAX EP004835_K20_MVAX ...
#>   EP996091_K40_MVAX EP996091_K60_MVAX
#> colData names(13): file_id md5 ... study_full_name project_name

IBD

Load 16S data as a matrix, rows are SILVA IDs, columns are sample names:

data("IBD16S_mtx")

Load the SILVA taxonomy table as a matrix, rows are SILVA IDs, columns are taxonomic ranks:

data("IBD16S_tax")
colnames(IBD16S_tax)
#> [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"

Load the 16S sample annotation data as a matrix, rows are samples, columns are annotations:

data("IBD16S_samp")
colnames(IBD16S_samp) %>% head()
#> [1] "Project"       "sample_id"     "subject_id"    "site_sub_coll"
#> [5] "data_type"     "week_num"

The IBD phyloseq object can be loaded as follows.

IBD <- IBD16S()
IBD
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 982 taxa and 178 samples ]
#> sample_data() Sample Data:       [ 178 samples by 490 sample variables ]
#> tax_table()   Taxonomy Table:    [ 982 taxa by 6 taxonomic ranks ]

T2D

Load 16S data as a matrix, rows are Greengene IDs, columns are sample names:

data("T2D16S_mtx")

Load the Greengenes taxonomy table as a matrix, rows are Greengene IDs, columns are taxonomic ranks:

data("T2D16S_tax")
colnames(T2D16S_tax)
#> [1] "Kingdom" "Phylum"  "Class"   "Order"   "Family"  "Genus"   "Species"

Load the 16S sample annotation data as a matrix, rows are samples, columns are annotations:

data("T2D16S_samp")
colnames(T2D16S_samp)
#>  [1] "file_id"          "md5"              "size"            
#>  [4] "urls"             "sample_id"        "file_name"       
#>  [7] "subject_id"       "sample_body_site" "visit_number"    
#> [10] "subject_gender"   "subject_race"     "study_full_name" 
#> [13] "project_name"

The T2D phyloseq object can be loaded like so.

T2D <- T2D16S()
T2D
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 12062 taxa and 2208 samples ]
#> sample_data() Sample Data:       [ 2208 samples by 13 sample variables ]
#> tax_table()   Taxonomy Table:    [ 12062 taxa by 7 taxonomic ranks ]

Features

Frequency Table Generation

The sample-level annotation data contianed in HMP2Data can be summarized using the table_two function. First, we need to make a list of the phyloseq and SummarizedExperiment objects which can then be entered into the table_two() table generating function.

list("MOMS-PI 16S" = momspi16S_phyloseq, "MOMS-PI Cytokines" = momspiCyto, "IBD 16S" = IBD, "T2D 16S" = T2D) %>% table_two()
MOMS-PI 16S
MOMS-PI Cytokines
IBD 16S
T2D 16S
N % N % N % N %
Body Site
buccal mucosa 3313 36.38 311 22.28 0 0 0 0
cervix of uterus 162 1.78 0 0 0 0 0 0
feces 765 8.4 0 0 178 100 1041 47.15
rectum 2679 29.42 0 0 0 0 0 0
unknown 146 1.6 0 0 0 0 0 0
vagina 2042 22.42 979 70.13 0 0 0 0
blood cell 0 0 106 7.59 0 0 0 0
nasal cavity 0 0 0 0 0 0 1167 52.85
Sex
male 0 0 0 0 84 47.19 1248 56.52
female 9107 100 1396 100 94 52.81 947 42.89
unknown 0 0 0 0 0 0 13 0.59
Race
african american 0 0 0 0 5 2.81 117 5.3
asian 0 0 0 0 0 0 235 10.64
caucasian 0 0 0 0 164 92.13 1657 75.05
ethnic other 0 0 0 0 8 4.49 73 3.31
hispanic or latino 0 0 0 0 0 0 126 5.71
american indian or alaska native 0 0 0 0 1 0.56 0 0
unknown 9107 100 1396 100 0 0 0 0
total samples 9107 100 1396 100 178 100 2208 100

Visits Histograms

We can plot the histogram of the number of samples at each visit for all studies. Note that the X-axis is capped at count of 40 for better visualization.

# make data.frame for plotting
plot_visits <- data.frame(study = c(rep("MOMS-PI Cytokines", nrow(momspiCyto_samp)),
                     rep("IBD 16S", nrow(IBD16S_samp)),
                     rep("T2D 16S", nrow(T2D16S_samp))),
          visits = c(momspiCyto_samp$visit_number,
                     IBD16S_samp$visit_number,
                     T2D16S_samp$visit_number))
p2 <- ggplot(plot_visits, aes(x = visits, fill = study)) + 
  geom_histogram(position = "dodge", alpha = 0.7, bins = 30, color = "#00BFC4") + xlim(c(0, 40)) +
  theme(legend.position = c(0.8, 0.8))  + 
  xlab("Visit number") + ylab("Count")
p2
#> Warning: Removed 677 rows containing non-finite values (stat_bin).
#> Warning: Removed 4 rows containing missing values (geom_bar).

Note that there are 677 samples at visit numbers over 40.

UpsetR plots

MOMS-PI 16S rRNA data

Here we plot the body sites which samples were taken from for each patient in the MOMS-PI 16S data.

# set up data.frame for UpSetR
momspi_upset <- aggregate(momspi16S_samp$sample_body_site, by = list(momspi16S_samp$subject_id), table)
tmp <- as.matrix(momspi_upset[, -1])
tmp <- (tmp > 0) *1
momspi_upset <- data.frame(patient = momspi_upset$Group.1, tmp)

# plot
upset(momspi_upset, order.by = 'freq', matrix.color = "blue", text.scale = 2)

MOMS-PI Cytokines data

Here we plot the body sites which samples were taken from for each patient in the MOMS-PI cytokines data.

# set up data.frame for UpSetR
momspiCyto_upset <- aggregate(momspiCyto_samp$sample_body_site, by = list(momspiCyto_samp$subject_id), table)
tmp <- as.matrix(momspiCyto_upset[, -1])
tmp <- (tmp > 0) *1
momspiCyto_upset <- data.frame(patient = momspiCyto_upset$Group.1, tmp)

# plot
upset(momspiCyto_upset, order.by = 'freq', matrix.color = "blue", text.scale = 2)

IBD data

The IBD patients only had samples taken from the feces.

T2D data

Here we plot the body sites which samples were taken from for each patient in the T2D 16S rRNA data.

# set up data.frame for UpSetR
T2D_upset <- aggregate(T2D16S_samp$sample_body_site, by = list(T2D16S_samp$subject_id), table)
tmp <- as.matrix(T2D_upset[, -1])
tmp <- (tmp > 0) *1
T2D_upset <- data.frame(patient = T2D_upset$Group.1, tmp)

# plot
upset(T2D_upset, order.by = 'freq', matrix.color = "blue", text.scale = 2)

Microbiome Analyses: IBD example


Table1Var <- c("subject_gender", "Age.at.diagnosis", "race", "visit_number",
               "site_name", "Education.Level", "Antibiotics", "BMI")

demo_data = sample_data(IBD)[ ,c("sample_id", "subject_id", Table1Var, "diagnosis")] %>%
            data.frame()

#all <- CreateTableOne(vars = Table1Var, data = demo_data)
#kableone(all)

stratified = tableone::CreateTableOne(
  vars = Table1Var,
  data = summarytools::unlabel(demo_data), strata = "diagnosis", includeNA = TRUE)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> Warning in ModuleReturnVarsExist(vars, data): These variables only have NA/
#> NaN: BMI Dropped
#> Warning in min(x, na.rm = TRUE): no non-missing arguments to min; returning
#> Inf
#> Warning in max(x, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf
#> Warning in StdDiff(variable = var, group = strataVar): Variable has only
#> NA's in at least one stratum. na.rm turned off.
stratified <- print(stratified, printToggle = FALSE, showAllLevels = FALSE)
stratified[,!(colnames(stratified) %in% "test")]%>%
  knitr::kable(format = "html", caption = "Characteristics of the Study Cohort",
               col.names = c("CD", "nonIBD", 
                             "UC", "p-value"))%>% 
  kable_styling("striped", full_width = T)
Table 1: Characteristics of the Study Cohort
CD nonIBD UC p-value
n 86 46 46
subject_gender = male (%) 43 ( 50.0) 24 ( 52.2) 17 ( 37.0) 0.264
Age.at.diagnosis (mean (SD)) 21.34 (11.31) NaN (NA) 25.65 (15.21) 0.067
race (%) 0.002
American Indian or Alaska Native 1 ( 1.2) 0 ( 0.0) 0 ( 0.0)
Black or African American 0 ( 0.0) 0 ( 0.0) 5 ( 10.9)
More than one race 0 ( 0.0) 2 ( 4.3) 2 ( 4.3)
Other 4 ( 4.7) 0 ( 0.0) 0 ( 0.0)
White 81 ( 94.2) 44 ( 95.7) 39 ( 84.8)
visit_number (mean (SD)) 5.33 (10.80) 2.35 (6.39) 1.67 (4.57) 0.034
site_name (%) <0.001
Cedars-Sinai 21 ( 24.4) 2 ( 4.3) 23 ( 50.0)
Cincinnati 28 ( 32.6) 20 ( 43.5) 11 ( 23.9)
MGH 24 ( 27.9) 18 ( 39.1) 6 ( 13.0)
MGH Pediatrics 13 ( 15.1) 6 ( 13.0) 6 ( 13.0)
Education.Level (%) 0.001
7th grade or less 14 ( 16.3) 18 ( 39.1) 7 ( 15.2)
Bachelor’s degree 9 ( 10.5) 10 ( 21.7) 4 ( 8.7)
High school graduate or GED 5 ( 5.8) 2 ( 4.3) 0 ( 0.0)
Master’s degree 0 ( 0.0) 2 ( 4.3) 4 ( 8.7)
Professional/Doctoral degree 6 ( 7.0) 2 ( 4.3) 4 ( 8.7)
Some college, no degree 12 ( 14.0) 0 ( 0.0) 7 ( 15.2)
Some high school 27 ( 31.4) 8 ( 17.4) 10 ( 21.7)
Unknown/Not Reported 11 ( 12.8) 2 ( 4.3) 10 ( 21.7)
NA 2 ( 2.3) 2 ( 4.3) 0 ( 0.0)
Antibiotics = No (%) 86 (100.0) 46 (100.0) 46 (100.0) NA

###Alpha diversity analysis

Alpha Diversity Definitions

Alpha diversity is a within-sample diversity measurement that models the richness and/or eveness of the microbial community. Different alpha diversities emphasize different aspect of the community. Commonly used alpha diversities include:

Diversity Description Formula
Observed Species No. of unique taxa in a sample Number of species
Chao 1 Adding a correction to observed species $S_{ob} + \frac{n_1^2}{2n_2}$
Shannon Both richness and evenness $-\sum_{i=1}^{s} p_i \log(p_i)$
Simpson reciprocal Both richness and evenness $\frac{N(N - 1)}{n(n-1)}$
Pielou’s evenness evenness $\frac{H}{H_{max}}$

Of course, the alpha diversity measurement is impacted by the sequencing depth (total number of reads per sample). Rarefying (through vegan:rarefy) is necessary before calculating alpha diversity.

Comparing Alpha Diversities between Groups

We can use “plot_richness” from phyloseq package to plot the alpha diversities.

IBD %<>%
  taxa_sums() %>%
  is_greater_than(0) %>%
  prune_taxa(IBD)
sample_data(IBD)$diagnosis = factor(sample_data(IBD)$diagnosis)

p = plot_richness(IBD, x="diagnosis", color="diagnosis", measures=c("Observed", "InvSimpson", "Shannon", "Chao1")) + geom_jitter()
p + geom_boxplot(data = p$data, aes(x = diagnosis, y = value, color = NULL), alpha = 0.1)
#> Warning: Removed 534 rows containing missing values (geom_errorbar).

Once we summarize the whole community through a one-dimensional summary statistics (alpha diversity), all the traditional methods for hypothesis testing are applicable.

alpha = estimate_richness(IBD, measures=c("Observed", "InvSimpson", "Shannon"))
alpha_df = data.frame(sample_data(IBD)[,c("sample_id","subject_id")], alpha)
rownames(alpha_df) = NULL
kable(head(alpha_df),
      digits = 3,caption = "Some Alpha Diversities of the IBD dataset") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F,position = "center") 
Table 2: Some Alpha Diversities of the IBD dataset
sample_id subject_id Observed Shannon InvSimpson
206534 M2008 112 2.572 5.590
206536 M2008 96 2.534 5.565
206538 M2008 80 2.655 6.452
206547 M2014 77 2.733 7.030
206548 M2014 69 2.632 6.556
206561 M2021 63 3.357 16.324

Beta diversity analysis

Beta Diversity Definitions

Beta diversity describes how samples vary against each other. Beta diversity is typically the thinking behind “clustering” algorithms that show differences or similarities among samples. For example, we may be intrested in the differences in gut microbiome between non IBD and IBD patients.

List of the commonly used beta-diversities:

Diversity Description Formula
UniFrac Qualitative, Phylogenetics-based $\sum_{i=1}^n\frac{b_i|I(p_i^A > 0)-I(p_i^B > 0)|}{\sum_{i=1}^n b_i}$
Weighted UniFrac Quantitative,Phylogenetics-based $\frac{\sum_{i=1}^n b_i|p_i^A - p_i^B|}{\sum_{i=1}^nb_i(p_i^A + p_i^B)}$
Generalized UniFrac Compromise between the previous two $\frac{\sum_{i=1}^n b_i (p_i^A + p_i^B)^{\alpha} |\frac{p_i^A - p_i^B}{p_i^A + p_i^B}|} {\sum_{i=1}^nb_i(p_i^A + p_i^B)^{\alpha}}$
Bray-Curtis Quantitative $1 - \frac{2C_{i,j}}{S_i + S_j}$
Jaccard Qualitative $\frac{|A \cap B|}{|A \cup B|}$

Comparing Beta Diversities between Groups


par(mfrow = c(1,2))
jac <- as.matrix(distance(t(otu_table(IBD)), method = "jaccard"))
mod1 <- vegan::betadisper(as.dist(jac), as.factor(IBD16S_samp$diagnosis))
plot(mod1, ellipse = TRUE, hull = FALSE, main="Jaccard Distance", xlab="PC 1", ylab="PC 2", sub=NULL)
box()

bc = as.matrix(ecodist::bcdist(t(otu_table(IBD))))
mod2 <- vegan::betadisper(as.dist(bc), as.factor(IBD16S_samp$diagnosis))
plot(mod2, ellipse = TRUE, hull = FALSE, main="Bray-Curtis Dissimilarity", xlab="PC 1", ylab="PC 2", sub=NULL, xlim = c(-0.2, 0.2))
legend("bottomright", legend=c("CD", "nonIBD", "UC"), pch = 1:3, col=c("Black", "red", "Green"), cex=1.3, box.lty=0)
box()

#####Permutational Multivariate Analysis of Variance

Permutational multivariate analysis of variance (PERMANOVA) (???), a non-parametric multivariate statistical test, is one commonly used for assessing the association between microbiome community and a phenotype (e.g., IBD or not).

However, the PERMANOVA approach, by itself, doesn’t allow for repeated measures. Ignoring the repeated measures (treating them as if they are all independent samples) can lead to inflated type I error (i.e., more false positive discoveries than we would like to accept through our p-value cutoff). However, we can use the method if we consider a subset of the samples which are independent with each other, such as those from the baseline.

PERMANOVA requires the selection of a single distance metric. Howeve, if the selected distance metric doesn’t capture the underlying association pattern, PERMANOVA tends to lose power.

IBD0 = subset_samples(IBD,  week_num == 0&diagnosis %in% c("nonIBD", "CD"))
IBD0 %<>%
  taxa_sums() %>%
  is_greater_than(0) %>%
  prune_taxa(IBD0)

jac0 <- as.matrix(distance(t(otu_table(IBD0)), method = "jaccard"))
bc0 =  as.matrix(ecodist::bcdist(t(otu_table(IBD0))))
CD = as.numeric(sample_data(IBD0)$diagnosis == "CD")
gender = as.numeric(factor(sample_data(IBD0)$subject_gender))

fit = vegan::adonis(bc0 ~ gender + CD, permutations=500)
kable(fit$aov.tab,
      digits = 3,caption = "PERMANOVA result using Bray-Curtis distance") %>% kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)  
Table 3: PERMANOVA result using Bray-Curtis distance
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
gender 1 0.421 0.421 1.288 0.017 0.178
CD 1 0.740 0.740 2.260 0.030 0.010
Residuals 71 23.237 0.327 NA 0.952 NA
Total 73 24.398 NA NA 1.000 NA
Microbiome Regression Kernel Association Test (MiRKAT)

MiRKAT: a regression approach for flexibly associating the microbial community with phenotype of interest (???). The method also has extensions that can handle longitudinal measures (???), multivariate (???) and survial outcomes (???).

g(yi) = Xiβ + f(Mi) + ϵi, f a reproducing kernel Hilbert space generated by function K. Equivalently, f ∼ N(0, τK), in which K is a similarity matrix that can be calculated through a distance metric.

Multiple kernels (distances) are allowed simulatenously in this framework, making this procedure robust to the underlying association patterns.

 K_bc = MiRKAT::D2K(bc0); K_jac = D2K(jac0)
 y = ifelse(sample_data(IBD0)$diagnosis == "CD",1, 0 )
 out = MiRKAT::MiRKAT(y=y, X = gender, Ks = list(K_bc, K_jac), out_type = "D") 

Multi-omics data integration: MOMS-PI example

Combine 16S and cytokines data

#order both sets by visit number within a subject
momspi16S_samp <- momspi16S_samp[
  with(momspi16S_samp, order(subject_id, sample_body_site, visit_number)),
] 

momspiCyto_samp <- momspiCyto_samp[
  with(momspiCyto_samp, order(subject_id, sample_body_site, visit_number)),
] 
#select data collected at the same visit
combined_samp <- merge(momspi16S_samp, momspiCyto_samp, 
                       by = c("subject_id", "sample_body_site", 
                        "project_name", "study_full_name",
                        "subject_gender", "subject_race",
                        "visit_number"))

In this tutorial, we concentrate on cross-sectional analysis of the first visit data, which corresponds to earlier time during pregnancy.

#select data from first visit only
combined_samp <- combined_samp[combined_samp$visit_number ==  1,]

table(combined_samp$sample_body_site)#all vaginal samples
#> 
#> vagina 
#>    225

We match the samples (contained in columns of both tables) by the file names contained in colnames of each table.

#> [1]  29 225

In ‘combined_samp’ object the names of matched files names for 16S data are recorded in column ‘file.x’ and for cytokines data in column ‘file.y’.

Make sure that samples are in rows and variables (taxa and cytokines) are in columns.

combined_16S_mtx <- t(combined_16S_mtx)
combined_Cyto_mtx <- t(combined_Cyto_mtx)

Taxa are converted to proportions.

combined_16S_mtx <- combined_16S_mtx/apply(combined_16S_mtx, 1, sum)
#> [1] TRUE

Co-inertia analysis

Basics:

  1. Let X and Y be 16S and cytokines tables respectively
  2. Rows: same n women at first visit
  3. Columns: p1 taxa and p2 cytokines
  4. Column weights: QX and QY
  5. Row weights: D
  6. PCA analysis of each table: (X, QX, D) and (Y, QY, D)
  7. Co-inertia axes: YTDX = KΛ1/2AT from eigendecomposition of (YTDX, QX, QY)
  8. Plot FX = XA and FY = YK

Taxa normalization:

  1. Center 16S data to work with PCA on the covariance matrix ΣX = Cov(X)
  2. To normalize the magnitude, divide each value of X by the total variance: $\sqrt{\mbox{tr}(\Sigma_X)}$
  3. Note: step 2 is quivalent to dividing the matrix by $\sqrt{\sum_{k=1}^r \lambda_k}$, where λk are the eigevalues of ΣX and r is the rank of X.
taxa_mtx <- scale(combined_16S_mtx, center = TRUE, scale = FALSE)
#use fast trace computation formula: tr(A^B) = sum(A*B), where '*' operator refers to elemetwise product
taxa_tr <- sum(taxa_mtx*taxa_mtx)/(dim(taxa_mtx)[1]-1)
taxa_mtx <- taxa_mtx/sqrt(taxa_tr)
taxa.pca <- ade4::dudi.pca(taxa_mtx, scannf=FALSE, nf =61,
                     center = FALSE, scale = FALSE)

Cytokines normalization:

  1. Center and scale cytokines data to work with PCA on the correlation matrix ΣX = Cor(Y)
  2. To normalize the magnitude, divide each value of Y by the total variance: $\sqrt{\mbox{tr}(\Sigma_Y)}$
cyto_mtx <- scale(combined_Cyto_mtx, center = TRUE, scale = TRUE)
cyto_tr <- sum(cyto_mtx*cyto_mtx)/(dim(cyto_mtx)[1]-1)
cyto_mtx <- cyto_mtx/sqrt(cyto_tr)
cyto.pca <- ade4::dudi.pca(cyto_mtx, scannf=FALSE, nf =61,
                     center = FALSE, scale = FALSE)

Combine the tables using co-inertia

Co-inertia is available through R package ade4. It takes ade4 PCA objects and performes joint eigendecomposition.

coin <- ade4::coinertia(taxa.pca, cyto.pca, scannf = FALSE, nf = 2)

RV coefficient – measure of similarity between 16S and cytokines tables

RV <- coin$RV
RV
#> [1] 0.0396224

Visualization of output

Interpretation of the plots is similar to examining PCA results.

Plot below provides variables projection on common co-inertia axes.

  1. Variables projected close to each other reflect similarilty within and across two data sets

  2. Variables with larger values on each component (x- and y-axes) have more importance

We start by identifying important taxa (larger component values):

  1. visually from the plot, or
taxa.inx <- c(6,  19,  24,  28,  31,  68,  69, 396)
  1. by specifying x and y coordinates to pull all variables with coordinates larger (or smaller) than these values. Table 1 (taxa) values can be accessed $co object of coinertia output named coin.
kable(head(coin$co),
      digits = 5)%>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Comp1 Comp2
807795 0.00002 0.00002
134726 -0.00006 0.00003
215097 0.00000 0.00000
542066 -0.00001 -0.00004
851634 0.00001 0.00000
354905 0.01164 -0.00768

Similarly, table 2 (here cytokines) variables loadings can be extracted from the $li object of coinertia output named coin.

kable(head(coin$li),
      digits = 5)%>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
Axis1 Axis2
Eotaxin -0.01231 -0.00518
FGF -0.00549 -0.00471
G-CSF 0.00815 -0.01028
GM-CSF 0.01838 -0.00531
IFN-g -0.00426 -0.00404
IL-10 -0.00591 -0.00791
Table 4: Influential taxa
Greengene ID Plot ID Family Genus Species
134467 19 Lactobacillaceae Lactobacillus NA
137183 28 Bifidobacteriaceae Gardnerella NA
137580 24 Lactobacillaceae Lactobacillus NA
29566 69 Leptotrichiaceae Sneathia NA
309133 396 Enterococcaceae NA NA
332718 31 Lactobacillaceae Lactobacillus NA
354905 6 Lactobacillaceae Lactobacillus NA
851726 68 Veillonellaceae Megasphaera NA

Sample scores plots. Length of the arrows indicates the samples that have larger differences across two data sets.

#> Warning: Ignoring unknown aesthetics: na.rm

#> Warning: Ignoring unknown aesthetics: na.rm

Samples with largest difference across two data sets. Samples with arrow lengths in 0.9 quantile are chosen.

#Taxa with major differences across two sets
large.difs <- rownames(Samp.coin$Dissimilarity[Samp.coin$Dissimilarity$Quantile >= 0.9, ])

Samples with largest differences across two data sets.

small.difs <- rownames(Samp.coin$Dissimilarity[Samp.coin$Dissimilarity$Quantile < 0.1, ])

Examine these taxa and cytokines

Morgan, Xochitl C, and Curtis Huttenhower. 2012. “Chapter 12: Human Microbiome Analysis.” PLoS Computational Biology 8 (12): e1002808. https://doi.org/10.1371/journal.pcbi.1002808.

Pasolli, Edoardo, Lucas Schiffer, Paolo Manghi, Audrey Renson, Valerie Obenchain, Duy Tin Truong, Francesco Beghini, et al. 2017. “Accessible, Curated Metagenomic Data Through ExperimentHub.” Nature Methods 14 (11). Nature Research: 1023–4. https://doi.org/10.1038/nmeth.4468.

Schiffer, Lucas, Rimsha Azhar, Lori Shepherd, Marcel Ramos, Ludwig Geistlinger, Curtis Huttenhower, Jennifer B Dowd, Nicola Segata, and Levi Waldron. 2019. “HMP16SData: Efficient Access to the Human Microbiome Project Through Bioconductor.” American Journal of Epidemiology, January. https://doi.org/10.1093/aje/kwz006.