RNA-seq workflow: gene-level exploratory analysis and differential expression

CSAMA 2024 version

Michael I. Love1,2, Charlotte Soneson3,4, Simon Anders5, Vladislav Kim6, Johannes Rainer7 and Wolfgang Huber6

1Department of Biostatistics, UNC-Chapel Hill, Chapel Hill, NC, US
2Department of Genetics, UNC-Chapel Hill, Chapel Hill, NC, US
3Friedrich Miescher Institute for Biomedical Research, Switzerland
4SIB Swiss Institute of Bioinformatics, Switzerland
5Centre for Molecular Biology, Univ. Heidelberg (ZMBH), Germany
6European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
7Institute for Biomedicine, Eurac Research, Bolzano, Italy

25 June 2024

1 Introduction

In this tutorial we walk through a gene-level RNA-seq differential expression analysis using Bioconductor packages. We start from the gene-vs-sample count matrix, and thus assume that the raw reads have already been quality controlled and that the gene expression has been quantified (either using alignment and counting, or by applying an alignment-free quantification tool). We perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, then perform differential gene expression analysis, and visually explore the results.

Bioconductor (Huber et al. 2015) has many packages supporting analysis of high-throughput sequence data, including RNA-seq. The packages that we will use in this tutorial include core packages maintained by the Bioconductor core team for importing and processing raw sequencing data and loading gene annotations. We will also use contributed packages for statistical analysis and visualization of sequencing data. Through scheduled releases every 6 months, the Bioconductor project ensures that all the packages within a release will work together in harmony (hence the “conductor” metaphor). The packages used in this tutorial are loaded with the library function and can be installed by following the Bioconductor package installation instructions. If you use the results from an R package in published research, you can find the proper citation for the software by typing citation("pkgName"), where you would substitute the name of the package for pkgName. Citing methods papers helps to support and reward the individuals who put time into open source software for genomic data analysis.

Many parts of this tutorial are based on a published RNA-seq workflow available via F1000Research (Love et al. 2015) and as a Bioconductor package.

1.1 Experimental data

The data used in this workflow is stored in an R package, airway2, containing quantification data for eight RNA-seq samples. The quantification files summarize an RNA-seq experiment wherein airway smooth muscle cells were treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects (Himes et al. 2014). Glucocorticoids are used, for example, by people with asthma to reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. For more description of the experiment see the PubMed entry 24926665 and for raw data see the GEO entry GSE52778.

The airway2 package contains output from Salmon (Patro et al. 2017), as well as a metadata file with sample annotations. More information about how the raw data was processed is available in this script.

We start by setting the path to the folder containing the quantifications (the output folders from Salmon). Since these are provided with an R package, we will point to the extdata subfolder of the installed package. For a typical analysis of your own data, you would point directly to a folder on your storage system (i.e., not using system.file()).

library(airway2)
dir <- system.file("extdata", package = "airway2")
list.files(dir)

## [1] "multiqc_config.yaml" "multiqc_report.html" "quants"              "Snakefile"          
## [5] "SraRunTable.txt"

The quantification directories, one for each sample, are in the quants folder.

list.files(file.path(dir, "quants"))

## [1] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" "SRR1039516" "SRR1039517" "SRR1039520"
## [8] "SRR1039521"

For more information about the output files generated by Salmon, see the documentation. The sample identifiers used here are the SRR identifiers from the Sequence Read Archive. For this experiment, we downloaded a table that links the SRR identifiers to the sample information about each experiment. From this Run Selector view of the experiment, we clicked the button: RunInfo Table. This downloads a file called SraRunTable.txt.

2 Reading the metadata

Now, we will read the metadata file that was just mentioned. We will only retain the annotations that we need for the rest of the analysis, but as you can see there is much more information recorded for these samples. The main annotations of interest for this tutorial are treatment, which represents the treatment of the sample (Untreated or Dexamethasone) and cell_line, which represents the cell line. The sample identifier is given by the Run column.

coldata <- read.delim(file.path(dir, "SraRunTable.txt"))
colnames(coldata)

##  [1] "AvgSpotLen"         "BioSample"          "Experiment"         "MBases"            
##  [5] "MBytes"             "Run"                "SRA_Sample"         "Sample_Name"       
##  [9] "cell_line"          "ercc_mix"           "treatment"          "Assay_Type"        
## [13] "BioProject"         "Center_Name"        "Consent"            "DATASTORE_filetype"
## [17] "DATASTORE_provider" "InsertSize"         "Instrument"         "LibraryLayout"     
## [21] "LibrarySelection"   "LibrarySource"      "LoadDate"           "Organism"          
## [25] "Platform"           "ReleaseDate"        "SRA_Study"          "cell_type"         
## [29] "source_name"        "tissue"

coldata <- coldata[, c("Run", "cell_line", "treatment")]
coldata

##          Run cell_line     treatment
## 1 SRR1039508    N61311     Untreated
## 2 SRR1039509    N61311 Dexamethasone
## 3 SRR1039512   N052611     Untreated
## 4 SRR1039513   N052611 Dexamethasone
## 5 SRR1039516   N080611     Untreated
## 6 SRR1039517   N080611 Dexamethasone
## 7 SRR1039520   N061011     Untreated
## 8 SRR1039521   N061011 Dexamethasone

The tximeta package, which we will use to import the quantifications, requires that there is a column called names, representing the sample names. Hence, we first rename the Run column.

colnames(coldata)[colnames(coldata) == "Run"] <- "names"
head(coldata)

##        names cell_line     treatment
## 1 SRR1039508    N61311     Untreated
## 2 SRR1039509    N61311 Dexamethasone
## 3 SRR1039512   N052611     Untreated
## 4 SRR1039513   N052611 Dexamethasone
## 5 SRR1039516   N080611     Untreated
## 6 SRR1039517   N080611 Dexamethasone

In addition to the names column, tximeta requires that coldata has a column named files, pointing to the Salmon output (the quant.sf file) for the respective samples.

coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf")
head(coldata)

##        names cell_line     treatment
## 1 SRR1039508    N61311     Untreated
## 2 SRR1039509    N61311 Dexamethasone
## 3 SRR1039512   N052611     Untreated
## 4 SRR1039513   N052611 Dexamethasone
## 5 SRR1039516   N080611     Untreated
## 6 SRR1039517   N080611 Dexamethasone
##                                                                                                             files
## 1 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/airway2/extdata/quants/SRR1039508/quant.sf
## 2 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/airway2/extdata/quants/SRR1039509/quant.sf
## 3 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/airway2/extdata/quants/SRR1039512/quant.sf
## 4 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/airway2/extdata/quants/SRR1039513/quant.sf
## 5 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/airway2/extdata/quants/SRR1039516/quant.sf
## 6 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/airway2/extdata/quants/SRR1039517/quant.sf

all(file.exists(coldata$files))

## [1] TRUE

Now we have everything we need, and can import the quantifications into R for further analysis.

3 Importing quantifications into R

We will next read the Salmon quantifications provided in the airway2 package into R and summarize the expected counts on the gene level. A simple way to import results from a variety of transcript abundance estimation tools into R is provided by the tximport and tximeta packages. Here, tximport reads the quantifications into a list of matrices, while tximeta instead aggregates the information into a SummarizedExperiment object, and also automatically adds additional annotations for the features. Both packages can return quantifications on the transcript level or aggregate them on the gene level. In the latter case, they also calculate average transcript lengths for each gene and each sample, which can be used as offsets to account for differential isoform usage across samples in the differential gene expression analysis (Soneson, Love, and Robinson 2015).

For Salmon, the quantifications for a given sample are stored in the quant.sf file, which is included in the output directory.

quants <- read.delim(file.path(dir, "quants", coldata$names[1], "quant.sf"))
head(quants)

##                Name Length EffectiveLength      TPM NumReads
## 1 ENST00000456328.2   1657        1561.450 0.793609   16.907
## 2 ENST00000450305.2    632         477.000 0.000000    0.000
## 3 ENST00000488147.1   1351        1185.158 4.288707   69.349
## 4 ENST00000619216.1     68           3.000 0.000000    0.000
## 5 ENST00000473358.1    712         557.000 0.000000    0.000
## 6 ENST00000469289.1    535         380.000 0.000000    0.000

Name is the transcript identifier.
Length is the annotated transcript length.
EffectiveLength is the length after accounting for the fact that a read is not equally likely to start anywhere in the transcript.
TPM is the estimated abundance in transcript-per-million.
NumReads is the estimated abundance in read counts. Note that these are typically not integers; however, they are still on the same “scale” as integer read counts returned by more traditional counting approaches, and can be used in their place for downstream analysis.

The code below imports the Salmon quantifications into R using the tximeta package. Note how the transcriptome that was used for the quantification is automatically recognized and used to annotate the resulting data object. In order for this to work, tximeta requires that the output folder structure from Salmon is retained, since it reads information from the associated log files in addition to the quantified abundances themselves.

suppressPackageStartupMessages({
    library(tximeta)
    library(DESeq2)
    library(org.Hs.eg.db)
    library(SummarizedExperiment)
})

## Import quantifications on the transcript level
st <- tximeta(coldata = coldata, type = "salmon", dropInfReps = TRUE)

## importing quantifications

## reading in files with read_tsv

## 1 2 3 4 5 6 7 8 
## found matching linked transcriptome:
## [ GENCODE - Homo sapiens - release 29 ]
## loading existing TxDb created: 2022-02-09 23:19:12
## loading existing transcript ranges created: 2022-02-09 23:19:13
## fetching genome info for GENCODE

st

## class: RangedSummarizedExperiment 
## dim: 205870 8 
## metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
## assays(3): counts abundance length
## rownames(205870): ENST00000456328.2 ENST00000450305.2 ... ENST00000387460.2
##   ENST00000387461.2
## rowData names(3): tx_id gene_id tx_name
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): names cell_line treatment

Question: How can you infer that the SummarizedExperiment object contains abundances for individual transcripts, rather than genes?

We see that tximeta has identified the transcriptome used for the quantification as GENCODE - Homo sapiens - release 29. How did this happen? In fact, the output directory from Salmon contains much more information than just the quant.sf file! (as mentioned above, this means that it is not advisable to move files out of the folder, or to share only the quant.sf file, since the context is lost):

list.files(file.path(dir, "quants", coldata$names[1]), recursive = TRUE)

##  [1] "aux_info/ambig_info.tsv"          "aux_info/bootstrap/bootstraps.gz"
##  [3] "aux_info/bootstrap/names.tsv.gz"  "aux_info/exp_gc.gz"              
##  [5] "aux_info/expected_bias.gz"        "aux_info/fld.gz"                 
##  [7] "aux_info/meta_info.json"          "aux_info/obs_gc.gz"              
##  [9] "aux_info/observed_bias_3p.gz"     "aux_info/observed_bias.gz"       
## [11] "cmd_info.json"                    "lib_format_counts.json"          
## [13] "libParams/flenDist.txt"           "logs/salmon_quant.log"           
## [15] "quant.sf"

In particular, the meta_info.json file contains a hash checksum, which is derived from the set of transcripts used as reference during the quantification and which lets tximeta identify the reference source (by comparing to a table of these hash checksums for commonly used references).

rjson::fromJSON(file = file.path(dir, "quants", coldata$names[1], 
                                 "aux_info", "meta_info.json"))

## $salmon_version
## [1] "0.14.1"
## 
## $samp_type
## [1] "gibbs"
## 
## $opt_type
## [1] "vb"
## 
## $quant_errors
## list()
## 
## $num_libraries
## [1] 1
## 
## $library_types
## [1] "IU"
## 
## $frag_dist_length
## [1] 1001
## 
## $seq_bias_correct
## [1] FALSE
## 
## $gc_bias_correct
## [1] TRUE
## 
## $num_bias_bins
## [1] 4096
## 
## $mapping_type
## [1] "mapping"
## 
## $num_valid_targets
## [1] 205870
## 
## $num_decoy_targets
## [1] 0
## 
## $num_eq_classes
## [1] 516269
## 
## $serialized_eq_classes
## [1] FALSE
## 
## $eq_class_properties
## [1] "range_factorized"
## 
## $length_classes
## [1]    520    669   1065   2328 205012
## 
## $index_seq_hash
## [1] "40849ed828ea7d6a94af54a5c40e5d87eb0ce0fc1e9513208a5cffe59d442292"
## 
## $index_name_hash
## [1] "77aca5545a0626421efb4730dd7b95482c77da261f9bdef70d36e25ee68bb7ef"
## 
## $index_seq_hash512
## [1] "f37ae2a7412efd8518d68c22fc3fcc2478b59833809382abd5d9055475505e516730c70914343af34c9232af92fe22832e70afde6b38c26381097068e1e82551"
## 
## $index_name_hash512
## [1] "67f286e5ec2895c10aa6e3a8feed04cb3a80747de7b3afb620298078481b303d2f979407fc104d4f3e8af0e82be2acb2f294ee5f2f68c00087f2855120d9f4ed"
## 
## $num_bootstraps
## [1] 20
## 
## $num_processed
## [1] 22935521
## 
## $num_mapped
## [1] 21100805
## 
## $num_decoy_fragments
## [1] 0
## 
## $num_dovetail_fragments
## [1] 108510
## 
## $num_fragments_filtered_vm
## [1] 285111
## 
## $num_alignments_below_threshold_for_mapped_fragments_vm
## [1] 874275
## 
## $percent_mapped
## [1] 92.00055
## 
## $call
## [1] "quant"
## 
## $start_time
## [1] "Tue Jul 30 17:57:25 2019"
## 
## $end_time
## [1] "Tue Jul 30 18:07:04 2019"

Question: Can you find a list of annotations recognized/supported by tximeta? Hint: Check the tximeta vignette. What can you do if you are working with e.g. a custom annotation?

The assays in the SummarizedExperiment object are created by directly importing the values from the quant.sf files and combining this information across the eight samples:

counts - NumReads column
abundance - TPM column
length - EffectiveLength column

We can access any of the assays via the assay function:

head(assay(st, "counts"), 3)

##                   SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
## ENST00000456328.2     16.907     16.673     10.237       0.00     18.719      0.000      0.000
## ENST00000450305.2      0.000      0.000      0.000       0.00      0.000      0.000      0.000
## ENST00000488147.1     69.349     77.948    199.467      45.81     97.035     95.227     56.568
##                   SRR1039521
## ENST00000456328.2      0.000
## ENST00000450305.2      0.000
## ENST00000488147.1     66.213

Question: Convince yourself that the import worked as described above by comparing some values to those present in the quant.sf text files output from Salmon.

Question: What do the column sums of the counts assay correspond to? What about the column sums of the abundance assay?

You may have noted that st is in fact a RangedSummarizedExperiment object (rather than “just” a SummarizedExperiment object). What does this mean? Let’s look at the information we have about the rows (transcripts) in the object:

rowRanges(st)

## GRanges object with 205870 ranges and 3 metadata columns:
##                     seqnames      ranges strand |     tx_id           gene_id           tx_name
##                        <Rle>   <IRanges>  <Rle> | <integer>   <CharacterList>       <character>
##   ENST00000456328.2     chr1 11869-14409      + |         1 ENSG00000223972.5 ENST00000456328.2
##   ENST00000450305.2     chr1 12010-13670      + |         2 ENSG00000223972.5 ENST00000450305.2
##   ENST00000488147.1     chr1 14404-29570      - |      9483 ENSG00000227232.5 ENST00000488147.1
##   ENST00000619216.1     chr1 17369-17436      - |      9484 ENSG00000278267.1 ENST00000619216.1
##   ENST00000473358.1     chr1 29554-31097      + |         3 ENSG00000243485.5 ENST00000473358.1
##                 ...      ...         ...    ... .       ...               ...               ...
##   ENST00000361681.2     chrM 14149-14673      - |    206692 ENSG00000198695.2 ENST00000361681.2
##   ENST00000387459.1     chrM 14674-14742      - |    206693 ENSG00000210194.1 ENST00000387459.1
##   ENST00000361789.2     chrM 14747-15887      + |    206684 ENSG00000198727.2 ENST00000361789.2
##   ENST00000387460.2     chrM 15888-15953      + |    206685 ENSG00000210195.2 ENST00000387460.2
##   ENST00000387461.2     chrM 15956-16023      - |    206694 ENSG00000210196.2 ENST00000387461.2
##   -------
##   seqinfo: 25 sequences (1 circular) from hg38 genome

By knowing the source and version of the reference used for the quantification, tximeta was able to retrieve the annotation files and decorate the object with information about the transcripts, such as the chromosome and position, and the corresponding gene ID. Importantly, Salmon did not use (or know about) any of this during the quantification! It needs only the transcript sequences. If we just want the annotation columns, without the ranges, we can get those with the rowData accessor:

rowData(st)

## DataFrame with 205870 rows and 3 columns
##                       tx_id           gene_id           tx_name
##                   <integer>   <CharacterList>       <character>
## ENST00000456328.2         1 ENSG00000223972.5 ENST00000456328.2
## ENST00000450305.2         2 ENSG00000223972.5 ENST00000450305.2
## ENST00000488147.1      9483 ENSG00000227232.5 ENST00000488147.1
## ENST00000619216.1      9484 ENSG00000278267.1 ENST00000619216.1
## ENST00000473358.1         3 ENSG00000243485.5 ENST00000473358.1
## ...                     ...               ...               ...
## ENST00000361681.2    206692 ENSG00000198695.2 ENST00000361681.2
## ENST00000387459.1    206693 ENSG00000210194.1 ENST00000387459.1
## ENST00000361789.2    206684 ENSG00000198727.2 ENST00000361789.2
## ENST00000387460.2    206685 ENSG00000210195.2 ENST00000387460.2
## ENST00000387461.2    206694 ENSG00000210196.2 ENST00000387461.2

Similar to the row annotations in rowData, the SummarizedExperiment object contains sample annotations in the colData slot.

colData(st)

## DataFrame with 8 rows and 3 columns
##                  names   cell_line     treatment
##            <character> <character>   <character>
## SRR1039508  SRR1039508      N61311     Untreated
## SRR1039509  SRR1039509      N61311 Dexamethasone
## SRR1039512  SRR1039512     N052611     Untreated
## SRR1039513  SRR1039513     N052611 Dexamethasone
## SRR1039516  SRR1039516     N080611     Untreated
## SRR1039517  SRR1039517     N080611 Dexamethasone
## SRR1039520  SRR1039520     N061011     Untreated
## SRR1039521  SRR1039521     N061011 Dexamethasone

4 Summarizing on the gene level

As we saw, the features in the SummarizedExperiment object above are individual transcripts, rather than genes. Often, however, we want to perform the analysis on the gene level, since the gene-level abundances are more robust and sometimes more interpretable than transcript-level abundances. The rowData contains the information about the corresponding gene for each transcript, in the gene_id column, and tximeta provides a function to summarize on the gene level:

Counts are added up
TPMs are added up
Transcript lengths are added up after weighting by the respective transcript TPMs
The ranges are chosen based on the most abundant isoform (new since 2023)

## Summarize quantifications on the gene level
sg <- tximeta::summarizeToGene(st, assignRanges="abundant")

## loading existing TxDb created: 2022-02-09 23:19:12

## obtaining transcript-to-gene mapping from database

## loading existing gene ranges created: 2022-02-09 23:20:02

## gene ranges assigned by isoform abundance, see `assignRanges`

## summarizing abundance

## summarizing counts

## summarizing length

sg

## class: RangedSummarizedExperiment 
## dim: 58294 8 
## metadata(7): tximetaInfo quantInfo ... txdbInfo assignRanges
## assays(3): counts abundance length
## rownames(58294): ENSG00000000003.14 ENSG00000000005.5 ... ENSG00000285993.1
##   ENSG00000285994.1
## rowData names(5): gene_id tx_ids isoform max_prop iso_prop
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): names cell_line treatment

Now we have a new RangedSummarizedExperiment object, with one row per gene. The row ranges have been summarized as well, and can be used for subsetting and interpretation just as for the transcripts.

At this point, the only information we have about the genes in our data set, apart from their genomic location and the associated transcript IDs, is the Ensembl ID. Often we need additional annotations, such as gene symbols. Bioconductor provides a range of annotation packages:

OrgDb packages, providing gene-based annotations for a given organism
TxDb and EnsDb packages, providing transcript ranges for a given genome build
BSgenome packages, providing the genome sequence for a given genome build

For our purposes here, the appropriate OrgDb package is the most suitable, since it contains gene-centric ID conversion tables. Since this is human data, we will use the org.Hs.eg.db package.

## Add gene symbols
## Works because we know the provenance of the identifiers
metadata(sg)$txomeInfo$source # GENCODE

## [1] "GENCODE"

sg <- tximeta::addIds(sg, "SYMBOL", gene = TRUE)

## mapping to new IDs using org.Hs.eg.db
## if all matching IDs are desired, and '1:many mappings' are reported,
## set multiVals='list' to obtain all the matching IDs

## 'select()' returned 1:many mapping between keys and columns

sg

## class: RangedSummarizedExperiment 
## dim: 58294 8 
## metadata(7): tximetaInfo quantInfo ... txdbInfo assignRanges
## assays(3): counts abundance length
## rownames(58294): ENSG00000000003.14 ENSG00000000005.5 ... ENSG00000285993.1
##   ENSG00000285994.1
## rowData names(6): gene_id tx_ids ... iso_prop SYMBOL
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): names cell_line treatment

head(rowData(sg))

## DataFrame with 6 rows and 6 columns
##                               gene_id                                                     tx_ids
##                           <character>                                            <CharacterList>
## ENSG00000000003.14 ENSG00000000003.14  ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
## ENSG00000000005.5   ENSG00000000005.5                        ENST00000373031.4,ENST00000485971.1
## ENSG00000000419.12 ENSG00000000419.12  ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
## ENSG00000000457.13 ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
## ENSG00000000460.16 ENSG00000000460.16  ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
## ENSG00000000938.12 ENSG00000000938.12  ENST00000374005.7,ENST00000399173.5,ENST00000374004.5,...
##                               isoform  max_prop                             iso_prop      SYMBOL
##                           <character> <numeric>                        <NumericList> <character>
## ENSG00000000003.14  ENST00000373020.8  0.914071    0.0474288,0.9140710,0.0000000,...      TSPAN6
## ENSG00000000005.5                  NA       NaN                              NaN,NaN        TNMD
## ENSG00000000419.12  ENST00000371588.9  0.824899    0.8248988,0.0224922,0.0933611,...        DPM1
## ENSG00000000457.13 ENST00000367771.10  0.305852       0.305852,0.271815,0.237448,...       SCYL3
## ENSG00000000460.16  ENST00000413811.3  0.300272 0.08591688,0.00941284,0.15431423,...       FIRRM
## ENSG00000000938.12  ENST00000374003.7  0.500000                            0,0,0,...         FGR

To see a list of the possible columns, use the columns function from the AnnotationDbi package:

AnnotationDbi::columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
##  [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GENETYPE"     "GO"          
## [13] "GOALL"        "IPI"          "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL" 
## [19] "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"      
## [25] "UCSCKG"       "UNIPROT"

We can even add annotations where we expect (and would like to retain) multiple mapping values, e.g., associated GO terms:

sg <- addIds(sg, "GO", multiVals = "list", gene = TRUE)

## mapping to new IDs using org.Hs.eg.db
## if all matching IDs are desired, and '1:many mappings' are reported,
## set multiVals='list' to obtain all the matching IDs

## 'select()' returned 1:many mapping between keys and columns

head(rowData(sg))

## DataFrame with 6 rows and 7 columns
##                               gene_id                                                     tx_ids
##                           <character>                                            <CharacterList>
## ENSG00000000003.14 ENSG00000000003.14  ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
## ENSG00000000005.5   ENSG00000000005.5                        ENST00000373031.4,ENST00000485971.1
## ENSG00000000419.12 ENSG00000000419.12  ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
## ENSG00000000457.13 ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
## ENSG00000000460.16 ENSG00000000460.16  ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
## ENSG00000000938.12 ENSG00000000938.12  ENST00000374005.7,ENST00000399173.5,ENST00000374004.5,...
##                               isoform  max_prop                             iso_prop      SYMBOL
##                           <character> <numeric>                        <NumericList> <character>
## ENSG00000000003.14  ENST00000373020.8  0.914071    0.0474288,0.9140710,0.0000000,...      TSPAN6
## ENSG00000000005.5                  NA       NaN                              NaN,NaN        TNMD
## ENSG00000000419.12  ENST00000371588.9  0.824899    0.8248988,0.0224922,0.0933611,...        DPM1
## ENSG00000000457.13 ENST00000367771.10  0.305852       0.305852,0.271815,0.237448,...       SCYL3
## ENSG00000000460.16  ENST00000413811.3  0.300272 0.08591688,0.00941284,0.15431423,...       FIRRM
## ENSG00000000938.12  ENST00000374003.7  0.500000                            0,0,0,...         FGR
##                                                      GO
##                                                  <list>
## ENSG00000000003.14 GO:0005515,GO:0016020,GO:0039532,...
## ENSG00000000005.5  GO:0001886,GO:0001937,GO:0005515,...
## ENSG00000000419.12 GO:0004169,GO:0004582,GO:0004582,...
## ENSG00000000457.13 GO:0000139,GO:0005515,GO:0005524,...
## ENSG00000000460.16 GO:0000278,GO:0000775,GO:0000776,...
## ENSG00000000938.12 GO:0001784,GO:0001784,GO:0001819,...

Note that Salmon returns estimated or expected counts, which are not necessarily integers. They may need to be rounded before they are passed to count-based statistical methods (however, DESeq2 will automatically round the counts, and some other methods, like edgeR, will work also with the non-integer values).

5 Representing counts for differential expression packages

At this point, we have a gene-level count matrix, contained in our SummarizedExperiment object. This is a branching point where we could use a variety of Bioconductor packages for exploration and differential expression of the count matrix, including edgeR (Robinson, McCarthy, and Smyth 2009), DESeq2 (Love, Huber, and Anders 2014), limma with the voom method (Law et al. 2014), DSS (Wu, Wang, and Wu 2013), EBSeq (Leng et al. 2013) and BaySeq (Hardcastle and Kelly 2010). We will continue using mainly DESeq2.

Bioconductor software packages often define and use a custom class for storing data that makes sure that all the needed data slots are consistently provided and fulfill any requirements. In addition, Bioconductor has general data classes (such as the SummarizedExperiment) that can be used to move data between packages. The DEFormats package can be useful for converting between different classes. The core Bioconductor classes also provide useful functionality: for example, subsetting or reordering the rows or columns of a SummarizedExperiment automatically subsets or reorders the associated rowRanges and colData, which can help to prevent accidental sample swaps that would otherwise lead to spurious results. With SummarizedExperiment this is all taken care of behind the scenes.

Each of the packages typically used for differential expression has a specific class of object used to store the summarization of the RNA-seq experiment and the intermediate quantities that are calculated during the statistical analysis of the data. DESeq2 uses a DESeqDataSet and edgeR uses a DGEList.

5.1 The DESeqDataSet, sample information, and the design formula

In DESeq2, the custom class is called DESeqDataSet. It is built on top of the SummarizedExperiment class, and it is easy to convert SummarizedExperiment objects into DESeqDataSet objects. One of the two main differences compared to a SummarizedExperiment object is that the assay slot named “counts” can be accessed using the counts accessor function, and the DESeqDataSet class enforces that the values in this matrix are non-negative integers.

A second difference is that the DESeqDataSet has an associated design formula. The experimental design is specified at the beginning of the analysis, as it will inform many of the DESeq2 functions how to treat the samples in the analysis (one exception is the size factor estimation, i.e., the adjustment for differing library sizes, which does not depend on the design formula). The design formula tells which columns in the sample information table (colData) specify the experimental design and how these factors should be used in the analysis.

Let’s remind ourselves of the design of our experiment:

colData(sg)

## DataFrame with 8 rows and 3 columns
##                  names   cell_line     treatment
##            <character> <character>   <character>
## SRR1039508  SRR1039508      N61311     Untreated
## SRR1039509  SRR1039509      N61311 Dexamethasone
## SRR1039512  SRR1039512     N052611     Untreated
## SRR1039513  SRR1039513     N052611 Dexamethasone
## SRR1039516  SRR1039516     N080611     Untreated
## SRR1039517  SRR1039517     N080611 Dexamethasone
## SRR1039520  SRR1039520     N061011     Untreated
## SRR1039521  SRR1039521     N061011 Dexamethasone

We have samples from two different conditions, and four cell lines:

table(colData(sg)$treatment)

## 
## Dexamethasone     Untreated 
##             4             4

table(colData(sg)$cell_line)

## 
## N052611 N061011 N080611  N61311 
##       2       2       2       2

We want to find the changes in gene expression that can be associated with the different treatments, but we also want to control for differences between the cell lines. The design which accomplishes this is obtained by writing ~ cell_line + treatment. By including cell_line, terms will be added to the model which account for differences between cell lines, and by adding treatment we get a term representing the treatment effect.

Note: it will be helpful for us if the first level of a factor is the reference level (e.g. control, or untreated samples). The reason is that by specifying this, functions further in the pipeline can be used and will give comparisons such as ‘treatment vs control,’ without needing to specify additional arguments.

We can relevel the treatment factor like so:

colData(sg)$treatment <- factor(colData(sg)$treatment)
colData(sg)$treatment <- relevel(colData(sg)$treatment, ref = "Untreated")
colData(sg)$treatment

## [1] Untreated     Dexamethasone Untreated     Dexamethasone Untreated     Dexamethasone
## [7] Untreated     Dexamethasone
## Levels: Untreated Dexamethasone

Also converting cell_line into a factor:

colData(sg)$cell_line <- factor(colData(sg)$cell_line)

We can use R’s formula notation to express any fixed-effects experimental design for edgeR or DESeq2. Note that these packages use the same formula notation as, for instance, the lm function of base R.

Using the ExploreModelMatrix R/Bioconductor package, we can represent our design in a graphical way. The first plot below visualizes the fitted values from the model for each combination of predictor variable levels, in terms of the estimated model coefficients. The second plot shows the number of observations per combination of levels in the predictor variables.

library(ExploreModelMatrix)
vd <- VisualizeDesign(sampleData = colData(sg), 
                      designFormula = ~ cell_line + treatment)
vd$plotlist

## [[1]]

vd$cooccurrenceplots

## [[1]]

We can also open the interactive interface to explore our design further:

ExploreModelMatrix(sampleData = colData(sg), 
                   designFormula = ~ cell_line + treatment)

To generate a DESeqDataSet object from a SummarizedExperiment object, we only need to additionally provide the experimental design in terms of a formula.

dds <- DESeqDataSet(sg, design = ~ cell_line + treatment)

## using counts and average transcript lengths from tximeta

We can also create a DESeqDataSet directly from a count matrix, a data frame with sample information and a design formula (see the DESeqDataSetFromMatrix function).

6 Filtering

It is often helpful to filter out lowly expressed genes before continuing with the analysis, to remove features that have nearly no information, increase the speed of the analysis and reduce the size of the data. At the very least we exclude genes with zero counts across all samples.

nrow(dds)

## [1] 58294

table(rowSums(assay(dds, "counts")) == 0)

## 
## FALSE  TRUE 
## 35573 22721

Here, we additionally filter to only keep genes with at least 10 counts for at least 4 samples. This again will help to reduce computation, and also makes some of the plots easier to read.

keep <- rowSums(counts(dds) >= 10) >= 4
table(keep)

## keep
## FALSE  TRUE 
## 41657 16637

dds <- dds[keep, ]
dim(dds)

## [1] 16637     8

Importantly, the group information should not be used to define the filtering criterion, since that can interfere with the validity of the multiple testing adjustment (FDR control) downstream.

7 Exploratory analysis and visualization

There are two separate analysis paths in this tutorial:

visual exploration of sample relationships, in which we will discuss transformation of the counts for computing distances or making plots
statistical testing for differences attributable to treatment, controlling for cell line effects

Importantly, the statistical testing methods rely on original count data (not scaled or transformed) for calculating the precision of measurements. However, for visualization and exploratory analysis, transformed counts are typically more suitable. Thus, it is critical to separate the two workflows and use the appropriate input data for each of them.

7.1 Transformations

Many common statistical methods for exploratory analysis of multidimensional data, for example clustering and principal components analysis (PCA), work best for data that generally has the same range of variance at different ranges of the mean values. When the expected amount of variance is approximately the same across different mean values, the data is said to be homoskedastic. For RNA-seq raw counts, however, the variance grows with the mean. For example, if one performs PCA directly on a matrix of size-factor-normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a small pseudocount; however, now the genes with the very lowest counts will tend to dominate the results because, due to the strong Poisson noise inherent to small count values, and the fact that the logarithm amplifies differences for the smallest values, these low count genes will show the strongest relative differences between samples.

We can quickly show this property of counts with some simulated data (here, Poisson counts with a range of lambda from 0.1 to 100). We plot the standard deviation of each row (genes) against the mean:

lambda <- 10^seq(from = -1, to = 2, length = 1000)
cts <- matrix(rpois(1000*100, lambda), ncol = 100)
library(vsn)
meanSdPlot(cts, ranks = FALSE)

And for logarithm-transformed counts after adding a pseudocount of 1:

log.cts.one <- log2(cts + 1)
meanSdPlot(log.cts.one, ranks = FALSE)

The logarithm with a small pseudocount amplifies differences when the values are close to 0. The low count genes with low signal-to-noise ratio will overly contribute to sample-sample distances and PCA plots.

As a solution, DESeq2 offers transformations for count data that stabilize the variance across the mean: the regularized logarithm (rlog) and the variance stabilizing transformation (VST). These have slightly different implementations, discussed a bit in the DESeq2 paper and in the vignette, but a similar goal of stabilizing the variance across the range of values. For genes with high counts, the VST and rlog will give similar result to the ordinary log2 transformation of normalized counts. For genes with lower counts, however, the values are shrunken towards the genes’ averages across all samples. The VST or rlog-transformed data then becomes approximately homoskedastic, and can be used directly for computing distances between samples, making PCA plots, or as input to downstream methods which perform best with homoskedastic data. Here we will use the variance stabilizing transformation implemented with the vst function.

vsd <- DESeq2::vst(dds, blind = FALSE)

## using 'avgTxLength' from assays(dds), correcting for library size

This returns a DESeqTransform object…

class(vsd)

## [1] "DESeqTransform"
## attr(,"package")
## [1] "DESeq2"

…which retains all the column metadata that was attached to the DESeqDataSet:

head(colData(vsd), 3)

## DataFrame with 3 rows and 3 columns
##                  names cell_line     treatment
##            <character>  <factor>      <factor>
## SRR1039508  SRR1039508   N61311  Untreated    
## SRR1039509  SRR1039509   N61311  Dexamethasone
## SRR1039512  SRR1039512   N052611 Untreated

We can also see that the variance has been stabilized:

meanSdPlot(assay(vsd), ranks = FALSE)

In the above function calls, we specified blind = FALSE, which means that differences between cell lines and treatment (the variables in the design) will not contribute to the expected variance-mean trend of the experiment. The experimental design is not used directly in the transformation, only in estimating the global amount of variability in the counts. For a fully unsupervised transformation, one can set blind = TRUE (which is the default).

7.2 PCA plot

One way to visualize sample-to-sample distances is a principal components analysis (PCA). In this ordination method, the data points (here, the samples) are projected onto the 2D plane such that they spread out in the two directions that explain most of the differences (figure below). The x-axis (the first principal component, or PC1) is the direction that separates the data points the most (i.e., the direction with the largest variance). The y-axis (the second principal component, or PC2) represents the direction with largest variance subject to the constraint that it must be orthogonal to the first direction. The percent of the total variance that is contained in the direction is printed in the axis label. Note that these percentages do not sum to 100%, because there are more dimensions that contain the remaining variance (although each of these remaining dimensions will explain less than the two that we see).

DESeq2::plotPCA(vsd, intgroup = "treatment")

## using ntop=500 top features by variance

From the PCA plot, we see that the differences between cells (the different plotting shapes) are considerable, though not stronger than the differences due to treatment with dexamethasone (red vs blue color). This shows why it will be important to account for this in differential testing by using a paired design (“paired,” because each dexamethasone treated sample is paired with one untreated sample from the same cell line). We are already set up for this design by assigning the formula ~ cell_line + treatment earlier.

8 Differential expression analysis

8.1 Performing differential expression testing with DESeq2

As we have already specified an experimental design when we created the DESeqDataSet, we can run the differential expression pipeline on the raw counts with a single call to the function DESeq.

dds <- DESeq2::DESeq(dds)

## estimating size factors

## using 'avgTxLength' from assays(dds), correcting for library size

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

We can also plot the estimated dispersions. This will plot both the gene-wise dispersion estimates (black), the fitted mean-dispersion trend (red curve), and the shrunken estimates (blue). You will notice that for some genes with high gene-wise dispersion estimates (by default more than two standard deviations above the fitted value), the blue circle overlaps with the black dot (i.e., the gene-wise estimates are not shrunken towards the trend), while for most genes, the shrunken estimate is not equal to the original gene-wise estimate.

DESeq2::plotDispEsts(dds)

The DESeq function will print out a message for the various steps it performs. These are described in more detail in the manual page, which can be accessed by typing ?DESeq. Briefly these are: the estimation of size factors (controlling for differences in the sequencing depth of the samples), the estimation of dispersion values for each gene, and fitting a generalized linear model.

A DESeqDataSet is returned that contains all the fitted parameters within it, and the following section describes how to extract out results tables of interest from this object.

Calling the results function without any arguments will extract the estimated log2 fold changes and p values for the last variable in the design formula. If there are more than 2 levels for this variable, results will extract the results table for a comparison of the last level over the first level. This comparison is printed at the top of the output: treatment Dexamethasone vs Untreated. Other comparisons can be performed via the contrast argument, if necessary.

res <- DESeq2::results(dds)
head(res)

## log2 fold change (MLE): treatment Dexamethasone vs Untreated 
## Wald test p-value: treatment Dexamethasone vs Untreated 
## DataFrame with 6 rows and 6 columns
##                     baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##                    <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## ENSG00000000003.14  740.1093      -0.365327 0.1073385 -3.403501 6.65282e-04 4.63552e-03
## ENSG00000000419.12  511.6990       0.202232 0.1278579  1.581694 1.13720e-01 2.88125e-01
## ENSG00000000457.13  314.1680       0.033792 0.1552106  0.217717 8.27650e-01 9.21718e-01
## ENSG00000000460.16   79.7988      -0.120633 0.3055270 -0.394836 6.92964e-01 8.48743e-01
## ENSG00000000971.15 5715.3064       0.442982 0.0904089  4.899766 9.59508e-07 1.36387e-05
## ENSG00000001036.13 2002.9872      -0.219458 0.0845409 -2.595886 9.43472e-03 4.33536e-02

As res is a DataFrame object, it carries metadata with information on the meaning of the columns:

mcols(res, use.names = TRUE)

## DataFrame with 6 rows and 2 columns
##                        type            description
##                 <character>            <character>
## baseMean       intermediate mean of normalized c..
## log2FoldChange      results log2 fold change (ML..
## lfcSE               results standard error: trea..
## stat                results Wald statistic: trea..
## pvalue              results Wald test p-value: t..
## padj                results   BH adjusted p-values

The first column, baseMean, is a just the average of the normalized count values, dividing by size factors, taken over all samples in the DESeqDataSet. The remaining four columns refer to a specific contrast, namely the comparison of the Dexamethasone level over the Untreated level for the factor variable treatment.

The column log2FoldChange is the effect size estimate. It tells us how much the gene’s expression seems to have changed due to Dexamethasone treatment in comparison to untreated samples. This value is reported on a logarithmic scale to base 2: for example, a log2 fold change of 1.5 means that the gene’s expression is increased by a multiplicative factor of 2^1.5.

Of course, this estimate has an uncertainty associated with it, which is available in the column lfcSE, the standard error estimate for the log2 fold change estimate. We can also express the uncertainty of a particular effect size estimate as the result of a statistical test. The purpose of a test for differential expression is to test whether the data provides sufficient evidence to conclude that this value is really different from zero. DESeq2 performs for each gene a hypothesis test to see whether evidence is sufficient to decide against the null hypothesis that there is zero effect of the treatment on the gene and that the observed difference between treatment and control was merely caused by experimental variability (i.e., the type of variability that you can expect between different samples in the same treatment group). As usual in statistics, the result of this test is reported as a p value, and it is found in the column pvalue. Remember that a p value indicates the probability that an effect as strong as the observed one, or even stronger, would be seen under the situation described by the null hypothesis.

We can also summarize the results with the following line of code, which reports some additional information, that will be covered in later sections.

summary(res)

## 
## out of 16637 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2362, 14%
## LFC < 0 (down)     : 2019, 12%
## outliers [1]       : 0, 0%
## low counts [2]     : 646, 3.9%
## (mean count < 12)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

hist(res$pvalue)

## Remove the genes that were filtered out in the independent filtering
hist(res$pvalue[!is.na(res$padj)])

## We also add a couple of extra columns that will be useful for the interactive
## visualization later
rowData(dds)$log10Dispersion <- log10(rowData(dds)$dispersion)

restmp <- DataFrame(res)
restmp$log10BaseMean <- log10(restmp$baseMean)
restmp$mlog10PValue <- -log10(restmp$pvalue)
colnames(restmp) <- paste0("DESeq2_dex_vs_untrt_", colnames(restmp))
rowData(dds) <- cbind(rowData(dds), restmp)

Note that there are many genes with differential expression due to Dexamethasone treatment at the FDR level of 10%. There are two ways to be more strict about which set of genes are considered significant:

lower the false discovery rate threshold (the threshold on padj in the results table)
raise the log2 fold change threshold from 0 using the lfcThreshold argument of results

If we lower the false discovery rate threshold, we should also tell this value to results(), so that the function will use an alternative threshold for the optimal independent filtering step:

res.05 <- results(dds, alpha = 0.05, 
                  contrast = c("treatment", "Dexamethasone", "Untreated"))
table(res.05$padj < 0.05)

## 
## FALSE  TRUE 
## 12712  3602

If we want to raise the log2 fold change threshold, so that we test for genes that show more substantial changes due to treatment, we simply supply a value on the log2 scale. For example, by specifying lfcThreshold = 1, we look for genes that show significant effects of treatment on gene counts more than doubling or less than halving, because 2^1 = 2.

resLFC1 <- results(dds, lfcThreshold = 1, 
                   contrast = c("treatment", "Dexamethasone", "Untreated"))
summary(resLFC1)

## 
## out of 16637 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 1.00 (up)    : 156, 0.94%
## LFC < -1.00 (down) : 84, 0.5%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

table(resLFC1$padj < 0.1)

## 
## FALSE  TRUE 
## 16397   240

Sometimes a subset of the p values in res will be NA (“not available”). This is DESeq’s way of reporting that all counts for this gene were zero, and hence no test was applied. In addition, p values can be assigned NA if the gene was excluded from analysis because it contained an extreme count outlier. For more information, see the outlier detection section of the DESeq2 vignette.

With DESeq2, there is also an easy way to plot the (normalized, transformed) counts for specific genes, using the plotCounts function:

plotCounts(dds, gene = "ENSG00000189221.9", intgroup = "treatment", 
           normalized = TRUE, transform = FALSE)

9 Plotting results

9.1 MA plot with DESeq2

An MA-plot (Dudoit et al. 2002) provides a useful overview for an experiment with a two-group comparison. The log2 fold change for a particular comparison is plotted on the y-axis and the average of the counts normalized by size factor is shown on the x-axis (“M” for minus, because a log ratio is equal to log minus log, and “A” for average). Each gene is represented with a dot. Genes with an adjusted p value below a threshold (here 0.1, the default with DESeq2) are shown in color

DESeq2::plotMA(res, ylim = c(-5, 5))

We see that there are many genes with low expression levels that nevertheless have large fold changes (since we are, effectively, dividing by a small number). To get more interpretable log fold changes (e.g., for ranking genes), we use the lfcShrink function to shrink the log2 fold changes for the comparison of Dexamethasone-treated vs untreated samples. There are three types of shrinkage estimators in DESeq2, which are covered in the vignette. Here we specify the apeglm method for shrinking coefficients, which is good for shrinking the noisy LFC estimates while giving low bias LFC estimates for true large differences (Zhu, Ibrahim, and Love 2019). To use apeglm we specify a coefficient from the model to shrink, either by name or number as the coefficient appears in resultsNames(dds).

library(apeglm)
DESeq2::resultsNames(dds)

## [1] "Intercept"                            "cell_line_N061011_vs_N052611"        
## [3] "cell_line_N080611_vs_N052611"         "cell_line_N61311_vs_N052611"         
## [5] "treatment_Dexamethasone_vs_Untreated" "DESeq2_dex_vs_untrt_log2FoldChange"

resape <- DESeq2::lfcShrink(dds, coef = "treatment_Dexamethasone_vs_Untreated", 
                            type = "apeglm")

## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895

DESeq2::plotMA(resape, ylim = c(-5, 5))

9.2 Heatmap of the most significant genes

Another way of representing the results of a differential expression analysis is to construct a heatmap of the top differentially expressed genes. A heatmap is a “color coded expression matrix,” where the rows and columns are clustered using hierarchical clustering. Typically, it should not be applied to counts, but works better with transformed values. Here we show how it can be applied to the variance-stabilized values generated above. We would expect the contrasted sample groups to cluster separately (“by construction,” since the genes were selected to be most discriminative between the groups). The heatmap will allow us to display, e.g., the variability within the groups of the differentially expressed genes. We choose the top 30 differentially expressed genes. There are many functions in R that can generate heatmaps, here we show the one from the pheatmap package.

library(pheatmap)
stopifnot(rownames(vsd) == rownames(res))
mat <- assay(vsd)
rownames(mat) <- ifelse(!is.na(rowData(vsd)$SYMBOL), rowData(vsd)$SYMBOL, rownames(vsd))
mat <- mat[head(order(res$padj), 30), ]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(vsd)[, c("treatment"), drop = FALSE])
pheatmap(mat, annotation_col = df)

9.3 Interactive visualization with iSEE

iSEE is a Bioconductor package that allows interactive exploration of any data stored in a SummarizedExperiment container, or any class extending this (such as, e.g., the DESeqDataSet class, or the SingleCellExperiment for single-cell data). By calling the iSEE() function with the object as the first argument, an interactive application will be opened, in which all observed values as well as metadata columns (rowData and colData) can be explored.

library(iSEE)
library(iSEEu)
dds <- iSEEu::registerAveAbPatterns(dds, "log10BaseMean")
dds <- iSEEu::registerLogFCPatterns(dds, "log2FoldChange")
dds <- iSEEu::registerPValuePatterns(dds, "pvalue")
app <- iSEE(dds, initial = list(MAPlot(), VolcanoPlot(), 
                                RowDataTable(), FeatureAssayPlot()))
## shiny::runApp(app)

9.4 Exporting results to CSV file

You can easily save the results table in a CSV file that you can then share or load with a spreadsheet program such as Excel (note, however, that Excel sometimes does funny things to gene identifiers (Zeeberg et al. 2004; Ziemann, Eren, and El-Osta 2016)). The call to as.data.frame is necessary to convert the DataFrame object to a data.frame object that can be processed by write.csv. Here, we first show how to add gene symbols to the output table, and then export just the top 100 genes for demonstration.

stopifnot(all(rownames(res) == rownames(dds)))
res$symbol <- rowData(dds)$SYMBOL

resOrdered <- res[order(res$padj), ]
head(resOrdered)

## log2 fold change (MLE): treatment Dexamethasone vs Untreated 
## Wald test p-value: treatment Dexamethasone vs Untreated 
## DataFrame with 6 rows and 7 columns
##                     baseMean log2FoldChange     lfcSE      stat       pvalue         padj
##                    <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
## ENSG00000189221.9   2371.265        3.39426  0.135575   25.0361 2.47518e-138 3.95806e-134
## ENSG00000120129.5   3417.255        2.96990  0.120928   24.5593 3.44226e-133 2.75226e-129
## ENSG00000101347.9  14106.720        3.74934  0.156084   24.0212 1.66892e-127 8.89591e-124
## ENSG00000152583.12   973.479        4.50022  0.197111   22.8309 2.26145e-115 9.04071e-112
## ENSG00000196136.17  2708.309        3.24329  0.143803   22.5538 1.23286e-112 3.94293e-109
## ENSG00000211445.11 12502.886        3.76804  0.168068   22.4198 2.52549e-111 6.73085e-108
##                         symbol
##                    <character>
## ENSG00000189221.9         MAOA
## ENSG00000120129.5        DUSP1
## ENSG00000101347.9       SAMHD1
## ENSG00000152583.12     SPARCL1
## ENSG00000196136.17    SERPINA3
## ENSG00000211445.11        GPX3

resOrderedDF <- as.data.frame(resOrdered)[seq_len(100), ]
write.table(cbind(id = rownames(resOrderedDF), resOrderedDF), 
            file = "results.txt", quote = FALSE, sep = "\t",
            row.names = FALSE)

10 Bonus section: differential expression analysis with edgeR

In the workflow above, we used DESeq2 to perform the exploration and differential expression analysis. Here, we illustrate how it can be done with edgeR, and briefly compare the results.

10.1 The DGEList

The edgeR package uses another type of data container than DESeq2, namely a DGEList object. tximeta provides a convenient wrapper function to generate a DGEList from the gene-level SummarizedExperiment object:

library(edgeR)

dge <- tximeta::makeDGEList(sg)
names(dge)

## [1] "counts"  "samples" "genes"   "offset"

head(dge$samples)

##            group lib.size norm.factors      names cell_line     treatment
## SRR1039508     1 21100805            1 SRR1039508    N61311     Untreated
## SRR1039509     1 19298584            1 SRR1039509    N61311 Dexamethasone
## SRR1039512     1 26145537            1 SRR1039512   N052611     Untreated
## SRR1039513     1 15688246            1 SRR1039513   N052611 Dexamethasone
## SRR1039516     1 25268618            1 SRR1039516   N080611     Untreated
## SRR1039517     1 31891456            1 SRR1039517   N080611 Dexamethasone

As for the DESeqDataSet, a DGEList can also be generated directly from a count matrix and sample metadata (see the DGEList() constructor function). Just like the SummarizedExperiment and the DESeqDataSet, the DGEList contains all the information we need: the count matrix, information about the samples (the columns of the count matrix), and information about the genes (the rows of the count matrix). One difference compared to the DESeqDataSet is that the experimental design is not defined when creating the DGEList, but later in the workflow.

Next, we filter out the genes with low expression levels. edgeR provides a function (filterByExpr) to do this in a straightforward way.

design <- model.matrix(~ cell_line + treatment, data = dge$samples)
keep <- edgeR::filterByExpr(dge, design = design)
dge <- dge[keep, ]
dim(dge)

## [1] 17542     8

10.2 Exploratory analysis

We saw above how to perform PCA to reduce the dimensionality and explore our data set. Another way to reduce dimensionality, which is in many ways similar (and under certain circumstances equivalent) to PCA, is multidimensional scaling (MDS). For MDS, we first have to calculate all pairwise distances between our objects (samples in this case), and then create a (typically) two-dimensional representation where these pre-calculated distances are represented as accurately as possible. This means that depending on how the pairwise sample distances are defined, the two-dimensional plot can be very different, and it is important to choose a distance that is suitable for the type of data at hand.

edgeR contains a function plotMDS, which operates on a DGEList object and generates a two-dimensional MDS representation of the samples. The default distance between two samples can be interpreted as the “typical” log fold change between the two samples, for the genes that are most different between them (by default, the top 500 genes, but this can be modified). We generate an MDS plot from the DGEList object dge, coloring by the treatment and using different plot symbols for different donors.

Note: Since the DGEList was created using the makeDGEList function, the average transcript length offsets have been incorporated in the object and will be used as offsets in downstream analysis. If this is not the case, we need to estimate TMM normalization factors before performing further analysis.

# dge <- edgeR::calcNormFactors(dge)
plotMDS(dge, top = 500, labels = NULL,
        col = as.numeric(dge$samples$treatment),
        cex = 0.5)

10.3 Differential expression analysis with edgeR

Next we will show how to perform differential expression analysis with edgeR. Recall that we have a DGEList dge, containing all the necessary information, and a design matrix:

names(dge)

## [1] "counts"  "samples" "genes"   "offset"

design

##            (Intercept) cell_lineN061011 cell_lineN080611 cell_lineN61311 treatmentDexamethasone
## SRR1039508           1                0                0               1                      0
## SRR1039509           1                0                0               1                      1
## SRR1039512           1                0                0               0                      0
## SRR1039513           1                0                0               0                      1
## SRR1039516           1                0                1               0                      0
## SRR1039517           1                0                1               0                      1
## SRR1039520           1                1                0               0                      0
## SRR1039521           1                1                0               0                      1
## attr(,"assign")
## [1] 0 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$cell_line
## [1] "contr.treatment"
## 
## attr(,"contrasts")$treatment
## [1] "contr.treatment"

Next, we estimate the dispersion for each gene and plot the estimated dispersions.

dge <- edgeR::estimateDisp(dge, design)
edgeR::plotBCV(dge)

Finally, we fit the generalized linear model and perform the test. In the glmQLFTest function, we indicate which coefficient (which column in the design matrix) that we would like to test for. It is possible to test more general contrasts as well, and the user guide contains many examples on how to do this. The topTags function extracts the top-ranked genes. You can indicate the adjusted p-value cutoff, and/or the number of genes to keep.

fit <- edgeR::glmQLFit(dge, design)
qlf <- edgeR::glmQLFTest(fit, coef = "treatmentDexamethasone")
tt.all <- edgeR::topTags(qlf, n = nrow(dge), sort.by = "none") # all genes
hist(tt.all$table$PValue)

tt <- edgeR::topTags(qlf, n = nrow(dge), p.value = 0.1) # genes with adj.p<0.1
tt10 <- edgeR::topTags(qlf) # just the top 10 by default
tt10

## Coefficient:  treatmentDexamethasone 
##                               gene_id       tx_ids            isoform  max_prop     iso_prop
## ENSG00000168309.17 ENSG00000168309.17 ENST0000....  ENST00000360997.6 0.7187553 0.718755....
## ENSG00000109906.13 ENSG00000109906.13 ENST0000....  ENST00000335953.8 0.6155125 0.615512....
## ENSG00000250978.5   ENSG00000250978.5 ENST0000....  ENST00000515184.1 0.5894650 0.410534....
## ENSG00000120129.5   ENSG00000120129.5 ENST0000....  ENST00000239223.3 1.0000000            1
## ENSG00000189221.9   ENSG00000189221.9 ENST0000....  ENST00000338702.3 0.8443803 0.002018....
## ENSG00000146250.6   ENSG00000146250.6 ENST0000....  ENST00000369700.3 1.0000000            1
## ENSG00000233117.2   ENSG00000233117.2 ENST0000....  ENST00000608792.1 0.6560386 0.656038....
## ENSG00000157214.13 ENSG00000157214.13 ENST0000....  ENST00000287908.7 0.4012273 0.002543....
## ENSG00000162614.18 ENSG00000162614.18 ENST0000.... ENST00000330010.12 0.6452494 0.023317....
## ENSG00000106484.15 ENSG00000106484.15 ENST0000....  ENST00000223215.9 0.7277559 0.001371....
##                     SYMBOL           GO     logFC   logCPM        F       PValue          FDR
## ENSG00000168309.17 FAM107A GO:00015....  4.732685 2.739486 705.3301 5.236993e-10 9.186733e-06
## ENSG00000109906.13  ZBTB16 GO:00001....  6.338011 4.063221 931.8917 3.800971e-09 3.184294e-05
## ENSG00000250978.5     <NA>           NA  5.744063 1.135156 273.8724 1.073525e-08 3.184294e-05
## ENSG00000120129.5    DUSP1 GO:00017....  2.962313 7.222537 689.7462 1.105692e-08 3.184294e-05
## ENSG00000189221.9     MAOA GO:00055....  3.380304 6.688272 673.8329 1.206313e-08 3.184294e-05
## ENSG00000146250.6   PRSS35 GO:00055.... -2.727241 3.853948 659.3912 1.309105e-08 3.184294e-05
## ENSG00000233117.2     <NA>           NA  4.868884 1.771002 663.2961 1.328077e-08 3.184294e-05
## ENSG00000157214.13  STEAP2 GO:00001....  2.005270 7.055536 625.0409 1.596428e-08 3.184294e-05
## ENSG00000162614.18    NEXN GO:00058....  2.000846 7.936632 618.1196 1.664065e-08 3.184294e-05
## ENSG00000106484.15    MEST GO:00055.... -2.045478 5.414366 572.3609 2.216233e-08 3.184294e-05

The columns in the edgeR result data frame are similar to the ones output by DESeq2. edgeR represents the overall expression level on the log-CPM scale rather than on the normalized count scale that DESeq2 uses. The F column contains the test statistic, and the FDR column contains the Benjamini-Hochberg adjusted p-values.

We can compare the sets of significantly differentially expressed genes to see how the results from the two packages overlap:

shared <- intersect(rownames(res), rownames(tt.all$table))
dmatch <- match(shared, rownames(res))
ematch <- match(shared, rownames(tt.all$table))
table(DESeq2 = res$padj[dmatch] < 0.1,
      edgeR = tt.all$table$FDR[ematch] < 0.1)

##        edgeR
## DESeq2  FALSE  TRUE
##   FALSE 10906   704
##   TRUE    107  4274

We can also compare the two result lists by the ranks:

plot(rank(res$pvalue[dmatch]),
     rank(tt.all$table$PValue[ematch]),
     cex = 0.1, xlab = "DESeq2", ylab = "edgeR")

Also with edgeR we can test for significance relative to a fold-change threshold, using the function glmTreat. Below we set the log fold-change threshold to 1 (i.e., fold change threshold equal to 2), as for DESeq2 above.

treatres <- edgeR::glmTreat(fit, coef = "treatmentDexamethasone", lfc = 1)
tt.treat <- edgeR::topTags(treatres, n = nrow(dge), sort.by = "none")

In edgeR, the MA plot is obtained via the plotSmear function.

edgeR::plotSmear(qlf, de.tags = rownames(tt$table))

## Warning in plot.xy(xy.coords(x, y), type = type, ...): "panel.first" is not a graphical parameter

Session information

It is good practice to always include a list of the software versions that were used to perform a given analysis, for reproducibility and trouble-shooting purposes. One way of achieving this is via the sessionInfo() function.

sessionInfo()

## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] edgeR_4.2.0                 limma_3.60.2                iSEEu_1.16.0               
##  [4] iSEEhex_1.6.0               iSEE_2.16.0                 SingleCellExperiment_1.26.0
##  [7] pheatmap_1.0.12             apeglm_1.26.0               vsn_3.72.0                 
## [10] ExploreModelMatrix_1.16.0   GenomicFeatures_1.56.0      org.Hs.eg.db_3.19.1        
## [13] AnnotationDbi_1.66.0        DESeq2_1.44.0               SummarizedExperiment_1.34.0
## [16] Biobase_2.64.0              MatrixGenerics_1.16.0       matrixStats_1.3.0          
## [19] GenomicRanges_1.56.0        GenomeInfoDb_1.40.1         IRanges_2.38.0             
## [22] S4Vectors_0.42.0            BiocGenerics_0.50.0         tximeta_1.22.1             
## [25] airway2_0.1.0               BiocStyle_2.32.0            testthat_3.2.1.1           
## [28] rmarkdown_2.27              devtools_2.4.5              usethis_2.2.3              
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.4.0            later_1.3.2              BiocIO_1.14.0           
##   [4] bitops_1.0-7             filelock_1.0.3           tibble_3.2.1            
##   [7] preprocessCore_1.66.0    XML_3.99-0.16.1          lifecycle_1.0.4         
##  [10] httr2_1.0.1              doParallel_1.0.17        lattice_0.22-6          
##  [13] vroom_1.6.5              ensembldb_2.28.0         MASS_7.3-60.2           
##  [16] magrittr_2.0.3           sass_0.4.9               jquerylib_0.1.4         
##  [19] yaml_2.3.8               remotes_2.5.0            httpuv_1.6.15           
##  [22] sessioninfo_1.2.2        pkgbuild_1.4.4           RColorBrewer_1.1-3      
##  [25] cowplot_1.1.3            DBI_1.2.3                abind_1.4-5             
##  [28] pkgload_1.3.4            zlibbioc_1.50.0          purrr_1.0.2             
##  [31] AnnotationFilter_1.28.0  RCurl_1.98-1.14          rappdirs_0.3.3          
##  [34] circlize_0.4.16          GenomeInfoDbData_1.2.12  ggrepel_0.9.5           
##  [37] codetools_0.2-20         DelayedArray_0.30.1      DT_0.33                 
##  [40] xml2_1.3.6               shape_1.4.6.1            tidyselect_1.2.1        
##  [43] UCSC.utils_1.0.0         farver_2.1.2             shinyWidgets_0.8.6      
##  [46] BiocFileCache_2.12.0     GenomicAlignments_1.40.0 jsonlite_1.8.8          
##  [49] GetoptLong_1.0.5         ellipsis_0.3.2           iterators_1.0.14        
##  [52] foreach_1.5.2            bbmle_1.0.25.1           tools_4.4.0             
##  [55] progress_1.2.3           Rcpp_1.0.12              glue_1.7.0              
##  [58] SparseArray_1.4.8        mgcv_1.9-1               xfun_0.44               
##  [61] dplyr_1.1.4              numDeriv_2016.8-1.1      shinydashboard_0.7.2    
##  [64] withr_3.0.0              BiocManager_1.30.23      fastmap_1.2.0           
##  [67] fansi_1.0.6              shinyjs_2.1.0            digest_0.6.35           
##  [70] R6_2.5.1                 mime_0.12                listviewer_4.0.0        
##  [73] colorspace_2.1-0         biomaRt_2.60.0           RSQLite_2.3.7           
##  [76] hexbin_1.28.3            utf8_1.2.4               tidyr_1.3.1             
##  [79] generics_0.1.3           rtracklayer_1.64.0       prettyunits_1.2.0       
##  [82] httr_1.4.7               htmlwidgets_1.6.4        S4Arrays_1.4.1          
##  [85] pkgconfig_2.0.3          gtable_0.3.5             blob_1.2.4              
##  [88] ComplexHeatmap_2.20.0    XVector_0.44.0           brio_1.1.5              
##  [91] htmltools_0.5.8.1        profvis_0.3.8            bookdown_0.39           
##  [94] clue_0.3-65              ProtGenerics_1.36.0      rintrojs_0.3.4          
##  [97] scales_1.3.0             png_0.1-8                knitr_1.47              
## [100] tzdb_0.4.0               rjson_0.2.21             nlme_3.1-165            
## [103] coda_0.19-4.1            curl_5.2.1               shinyAce_0.4.2          
## [106] bdsmatrix_1.3-7          GlobalOptions_0.1.2      cachem_1.1.0            
## [109] stringr_1.5.1            BiocVersion_3.19.1       vipor_0.4.7             
## [112] parallel_4.4.0           miniUI_0.1.1.1           restfulr_0.0.15         
## [115] pillar_1.9.0             grid_4.4.0               vctrs_0.6.5             
## [118] urlchecker_1.0.1         promises_1.3.0           dbplyr_2.5.0            
## [121] cluster_2.1.6            xtable_1.8-4             tximport_1.32.0         
## [124] evaluate_0.23            readr_2.1.5              tinytex_0.51            
## [127] magick_2.8.3             mvtnorm_1.2-5            cli_3.6.2               
## [130] locfit_1.5-9.9           compiler_4.4.0           Rsamtools_2.20.0        
## [133] rlang_1.1.4              crayon_1.5.2             labeling_0.4.3          
## [136] emdbook_1.3.13           plyr_1.8.9               affy_1.82.0             
## [139] fs_1.6.4                 stringi_1.8.4            viridisLite_0.4.2       
## [142] BiocParallel_1.38.0      txdbmaker_1.0.0          munsell_0.5.1           
## [145] Biostrings_2.72.1        lazyeval_0.2.2           colourpicker_1.3.0      
## [148] Matrix_1.7-0             hms_1.1.3                bit64_4.0.5             
## [151] ggplot2_3.5.1            KEGGREST_1.44.0          statmod_1.5.0           
## [154] shiny_1.8.1.1            highr_0.11               AnnotationHub_3.12.0    
## [157] fontawesome_0.5.2        igraph_2.0.3             memoise_2.0.1           
## [160] affyio_1.74.0            bslib_0.7.0              bit_4.0.5

References

Dudoit, Rine, Yee H. Yang, Matthew J. Callow, and Terence P. Speed. 2002. “Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments.” In Statistica Sinica, 111–39. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.117.9702.

Hardcastle, Thomas, and Krystyna Kelly. 2010. “baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data.” BMC Bioinformatics 11 (1): 422+. https://doi.org/10.1186/1471-2105-11-422.

Himes, Blanca E., Xiaofeng Jiang, Peter Wagner, Ruoxi Hu, Qiyu Wang, Barbara Klanderman, Reid M. Whitaker, et al. 2014. “RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells.” PloS One 9 (6). https://doi.org/10.1371/journal.pone.0099625.

Huber, Wolfgang, Vincent J. Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S. Carvalho, Hector Corrada C. Bravo, et al. 2015. “Orchestrating high-throughput genomic analysis with Bioconductor.” Nature Methods 12 (2): 115–21. https://doi.org/10.1038/nmeth.3252.

Law, Charity W., Yunshun Chen, Wei Shi, and Gordon K. Smyth. 2014. “Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology 15 (2): R29+. https://doi.org/10.1186/gb-2014-15-2-r29.

Leng, N., J. A. Dawson, J. A. Thomson, V. Ruotti, A. I. Rissman, B. M. G. Smits, J. D. Haag, M. N. Gould, R. M. Stewart, and C. Kendziorski. 2013. “EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments.” Bioinformatics 29 (8): 1035–43. https://doi.org/10.1093/bioinformatics/btt087.

Love, Michael I., Simon Anders, Vladislav Kim, and Wolfgang Huber. 2015. “RNA-Seq Workflow: Gene-Level Exploratory Analysis and Differential Expression.” F1000Research, October. https://doi.org/10.12688/f1000research.7035.1.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12): 550+. https://doi.org/10.1186/s13059-014-0550-8.

Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nat. Methods 14: 417–19.

Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2009. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Soneson, Charlotte, Michael I. Love, and Mark Robinson. 2015. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.” F1000Research 4. https://doi.org/10.12688/f1000research.7563.1.

Wu, Hao, Chi Wang, and Zhijin Wu. 2013. “A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data.” Biostatistics 14 (2): 232–43. https://doi.org/10.1093/biostatistics/kxs033.

Zeeberg, Barry R, Joseph Riss, David W Kane, Kimberly J Bussey, Edward Uchio, W Marston Linehan, J Carl Barrett, and John N Weinstein. 2004. “Mistaken Identifiers: Gene Name Errors Can Be Introduced Inadvertently When Using Excel in Bioinformatics.” BMC Bioinformatics 5: 80.

Zhu, Anqi, Joseph G Ibrahim, and Michael I Love. 2019. “Heavy-Tailed Prior Distributions for Sequence Count Data: Removing the Noise and Preserving Large Differences.” Bioinformatics 35: 2084–92.

Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biol. 17 (1): 1–3.