Contents

1 Introduction

In this tutorial we walk through a gene-level RNA-seq differential expression analysis using Bioconductor packages. We start from the gene-vs-sample count matrix, and thus assume that the raw reads have already been quality controlled and that the gene expression has been quantified (either using alignment and counting, or by applying an alignment-free quantification tool). We perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, then perform differential gene expression analysis, and visually explore the results.

Bioconductor (Huber et al. 2015) has many packages supporting analysis of high-throughput sequence data, including RNA-seq. The packages that we will use in this tutorial include core packages maintained by the Bioconductor core team for importing and processing raw sequencing data and loading gene annotations. We will also use contributed packages for statistical analysis and visualization of sequencing data. Through scheduled releases every 6 months, the Bioconductor project ensures that all the packages within a release will work together in harmony (hence the “conductor” metaphor). The packages used in this tutorial are loaded with the library function and can be installed by following the Bioconductor package installation instructions. If you use the results from an R package in published research, you can find the proper citation for the software by typing citation("pkgName"), where you would substitute the name of the package for pkgName. Citing methods papers helps to support and reward the individuals who put time into open source software for genomic data analysis.

Many parts of this tutorial are based on a published RNA-seq workflow available via F1000Research (Love et al. 2015) and as a Bioconductor package.

1.1 Experimental data

The data used in this workflow is stored in an R package, airway2, containing quantification data for eight RNA-seq samples. The quantification files summarize an RNA-seq experiment wherein airway smooth muscle cells were treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects (Himes et al. 2014). Glucocorticoids are used, for example, by people with asthma to reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. For more description of the experiment see the PubMed entry 24926665 and for raw data see the GEO entry GSE52778.

The airway2 package contains output from Salmon (Patro et al. 2017), as well as a metadata file with sample annotations. More information about how the raw data was processed is available in this script.

We start by setting the path to the folder containing the quantifications (the output folders from Salmon). Since these are provided with an R package, we will point to the extdata subfolder of the installed package. For a typical analysis of your own data, you would point directly to a folder on your storage system (i.e., not using system.file()).

library(airway2)
dir <- system.file("extdata", package = "airway2")
list.files(dir)
## [1] "multiqc_config.yaml" "multiqc_report.html" "quants"              "Snakefile"          
## [5] "SraRunTable.txt"

The quantification directories, one for each sample, are in the quants folder.

list.files(file.path(dir, "quants"))
## [1] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" "SRR1039516" "SRR1039517" "SRR1039520"
## [8] "SRR1039521"

For more information about the output files generated by Salmon, see the documentation. The sample identifiers used here are the SRR identifiers from the Sequence Read Archive. For this experiment, we downloaded a table that links the SRR identifiers to the sample information about each experiment. From this Run Selector view of the experiment, we clicked the button: RunInfo Table. This downloads a file called SraRunTable.txt.

2 Reading the metadata

Now, we will read the metadata file that was just mentioned. We will only retain the annotations that we need for the rest of the analysis, but as you can see there is much more information recorded for these samples. The main annotations of interest for this tutorial are treatment, which represents the treatment of the sample (Untreated or Dexamethasone) and cell_line, which represents the cell line. The sample identifier is given by the Run column.

coldata <- read.delim(file.path(dir, "SraRunTable.txt"))
colnames(coldata)
##  [1] "AvgSpotLen"         "BioSample"          "Experiment"         "MBases"            
##  [5] "MBytes"             "Run"                "SRA_Sample"         "Sample_Name"       
##  [9] "cell_line"          "ercc_mix"           "treatment"          "Assay_Type"        
## [13] "BioProject"         "Center_Name"        "Consent"            "DATASTORE_filetype"
## [17] "DATASTORE_provider" "InsertSize"         "Instrument"         "LibraryLayout"     
## [21] "LibrarySelection"   "LibrarySource"      "LoadDate"           "Organism"          
## [25] "Platform"           "ReleaseDate"        "SRA_Study"          "cell_type"         
## [29] "source_name"        "tissue"
coldata <- coldata[, c("Run", "cell_line", "treatment")]
coldata
##          Run cell_line     treatment
## 1 SRR1039508    N61311     Untreated
## 2 SRR1039509    N61311 Dexamethasone
## 3 SRR1039512   N052611     Untreated
## 4 SRR1039513   N052611 Dexamethasone
## 5 SRR1039516   N080611     Untreated
## 6 SRR1039517   N080611 Dexamethasone
## 7 SRR1039520   N061011     Untreated
## 8 SRR1039521   N061011 Dexamethasone

The tximeta package, which we will use to import the quantifications, requires that there is a column called names, representing the sample names. Hence, we first rename the Run column.

colnames(coldata)[colnames(coldata) == "Run"] <- "names"
head(coldata)
##        names cell_line     treatment
## 1 SRR1039508    N61311     Untreated
## 2 SRR1039509    N61311 Dexamethasone
## 3 SRR1039512   N052611     Untreated
## 4 SRR1039513   N052611 Dexamethasone
## 5 SRR1039516   N080611     Untreated
## 6 SRR1039517   N080611 Dexamethasone

In addition to the names column, tximeta requires that coldata has a column named files, pointing to the Salmon output (the quant.sf file) for the respective samples.

coldata$files <- file.path(dir, "quants", coldata$names, "quant.sf")
head(coldata)
##        names cell_line     treatment
## 1 SRR1039508    N61311     Untreated
## 2 SRR1039509    N61311 Dexamethasone
## 3 SRR1039512   N052611     Untreated
## 4 SRR1039513   N052611 Dexamethasone
## 5 SRR1039516   N080611     Untreated
## 6 SRR1039517   N080611 Dexamethasone
##                                                                                                             files
## 1 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/airway2/extdata/quants/SRR1039508/quant.sf
## 2 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/airway2/extdata/quants/SRR1039509/quant.sf
## 3 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/airway2/extdata/quants/SRR1039512/quant.sf
## 4 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/airway2/extdata/quants/SRR1039513/quant.sf
## 5 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/airway2/extdata/quants/SRR1039516/quant.sf
## 6 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/airway2/extdata/quants/SRR1039517/quant.sf
all(file.exists(coldata$files))
## [1] TRUE

Now we have everything we need, and can import the quantifications into R for further analysis.

3 Importing quantifications into R

We will next read the Salmon quantifications provided in the airway2 package into R and summarize the expected counts on the gene level. A simple way to import results from a variety of transcript abundance estimation tools into R is provided by the tximport and tximeta packages. Here, tximport reads the quantifications into a list of matrices, while tximeta instead aggregates the information into a SummarizedExperiment object, and also automatically adds additional annotations for the features. Both packages can return quantifications on the transcript level or aggregate them on the gene level. In the latter case, they also calculate average transcript lengths for each gene and each sample, which can be used as offsets to account for differential isoform usage across samples in the differential gene expression analysis (Soneson, Love, and Robinson 2015).

For Salmon, the quantifications for a given sample are stored in the quant.sf file, which is included in the output directory.

quants <- read.delim(file.path(dir, "quants", coldata$names[1], "quant.sf"))
head(quants)
##                Name Length EffectiveLength      TPM NumReads
## 1 ENST00000456328.2   1657        1561.450 0.793609   16.907
## 2 ENST00000450305.2    632         477.000 0.000000    0.000
## 3 ENST00000488147.1   1351        1185.158 4.288707   69.349
## 4 ENST00000619216.1     68           3.000 0.000000    0.000
## 5 ENST00000473358.1    712         557.000 0.000000    0.000
## 6 ENST00000469289.1    535         380.000 0.000000    0.000

The code below imports the Salmon quantifications into R using the tximeta package. Note how the transcriptome that was used for the quantification is automatically recognized and used to annotate the resulting data object. In order for this to work, tximeta requires that the output folder structure from Salmon is retained, since it reads information from the associated log files in addition to the quantified abundances themselves.

suppressPackageStartupMessages({
    library(tximeta)
    library(DESeq2)
    library(org.Hs.eg.db)
    library(SummarizedExperiment)
})

## Import quantifications on the transcript level
st <- tximeta(coldata = coldata, type = "salmon", dropInfReps = TRUE)
## importing quantifications
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 
## found matching linked transcriptome:
## [ GENCODE - Homo sapiens - release 29 ]
## loading existing TxDb created: 2022-02-09 23:19:12
## loading existing transcript ranges created: 2022-02-09 23:19:13
## fetching genome info for GENCODE
st
## class: RangedSummarizedExperiment 
## dim: 205870 8 
## metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
## assays(3): counts abundance length
## rownames(205870): ENST00000456328.2 ENST00000450305.2 ... ENST00000387460.2
##   ENST00000387461.2
## rowData names(3): tx_id gene_id tx_name
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): names cell_line treatment

Question: How can you infer that the SummarizedExperiment object contains abundances for individual transcripts, rather than genes?

We see that tximeta has identified the transcriptome used for the quantification as GENCODE - Homo sapiens - release 29. How did this happen? In fact, the output directory from Salmon contains much more information than just the quant.sf file! (as mentioned above, this means that it is not advisable to move files out of the folder, or to share only the quant.sf file, since the context is lost):

list.files(file.path(dir, "quants", coldata$names[1]), recursive = TRUE)
##  [1] "aux_info/ambig_info.tsv"          "aux_info/bootstrap/bootstraps.gz"
##  [3] "aux_info/bootstrap/names.tsv.gz"  "aux_info/exp_gc.gz"              
##  [5] "aux_info/expected_bias.gz"        "aux_info/fld.gz"                 
##  [7] "aux_info/meta_info.json"          "aux_info/obs_gc.gz"              
##  [9] "aux_info/observed_bias_3p.gz"     "aux_info/observed_bias.gz"       
## [11] "cmd_info.json"                    "lib_format_counts.json"          
## [13] "libParams/flenDist.txt"           "logs/salmon_quant.log"           
## [15] "quant.sf"

In particular, the meta_info.json file contains a hash checksum, which is derived from the set of transcripts used as reference during the quantification and which lets tximeta identify the reference source (by comparing to a table of these hash checksums for commonly used references).

rjson::fromJSON(file = file.path(dir, "quants", coldata$names[1], 
                                 "aux_info", "meta_info.json"))
## $salmon_version
## [1] "0.14.1"
## 
## $samp_type
## [1] "gibbs"
## 
## $opt_type
## [1] "vb"
## 
## $quant_errors
## list()
## 
## $num_libraries
## [1] 1
## 
## $library_types
## [1] "IU"
## 
## $frag_dist_length
## [1] 1001
## 
## $seq_bias_correct
## [1] FALSE
## 
## $gc_bias_correct
## [1] TRUE
## 
## $num_bias_bins
## [1] 4096
## 
## $mapping_type
## [1] "mapping"
## 
## $num_valid_targets
## [1] 205870
## 
## $num_decoy_targets
## [1] 0
## 
## $num_eq_classes
## [1] 516269
## 
## $serialized_eq_classes
## [1] FALSE
## 
## $eq_class_properties
## [1] "range_factorized"
## 
## $length_classes
## [1]    520    669   1065   2328 205012
## 
## $index_seq_hash
## [1] "40849ed828ea7d6a94af54a5c40e5d87eb0ce0fc1e9513208a5cffe59d442292"
## 
## $index_name_hash
## [1] "77aca5545a0626421efb4730dd7b95482c77da261f9bdef70d36e25ee68bb7ef"
## 
## $index_seq_hash512
## [1] "f37ae2a7412efd8518d68c22fc3fcc2478b59833809382abd5d9055475505e516730c70914343af34c9232af92fe22832e70afde6b38c26381097068e1e82551"
## 
## $index_name_hash512
## [1] "67f286e5ec2895c10aa6e3a8feed04cb3a80747de7b3afb620298078481b303d2f979407fc104d4f3e8af0e82be2acb2f294ee5f2f68c00087f2855120d9f4ed"
## 
## $num_bootstraps
## [1] 20
## 
## $num_processed
## [1] 22935521
## 
## $num_mapped
## [1] 21100805
## 
## $num_decoy_fragments
## [1] 0
## 
## $num_dovetail_fragments
## [1] 108510
## 
## $num_fragments_filtered_vm
## [1] 285111
## 
## $num_alignments_below_threshold_for_mapped_fragments_vm
## [1] 874275
## 
## $percent_mapped
## [1] 92.00055
## 
## $call
## [1] "quant"
## 
## $start_time
## [1] "Tue Jul 30 17:57:25 2019"
## 
## $end_time
## [1] "Tue Jul 30 18:07:04 2019"

Question: Can you find a list of annotations recognized/supported by tximeta? Hint: Check the tximeta vignette. What can you do if you are working with e.g. a custom annotation?

The assays in the SummarizedExperiment object are created by directly importing the values from the quant.sf files and combining this information across the eight samples:

We can access any of the assays via the assay function:

head(assay(st, "counts"), 3)
##                   SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520
## ENST00000456328.2     16.907     16.673     10.237       0.00     18.719      0.000      0.000
## ENST00000450305.2      0.000      0.000      0.000       0.00      0.000      0.000      0.000
## ENST00000488147.1     69.349     77.948    199.467      45.81     97.035     95.227     56.568
##                   SRR1039521
## ENST00000456328.2      0.000
## ENST00000450305.2      0.000
## ENST00000488147.1     66.213

Question: Convince yourself that the import worked as described above by comparing some values to those present in the quant.sf text files output from Salmon.

Question: What do the column sums of the counts assay correspond to? What about the column sums of the abundance assay?

You may have noted that st is in fact a RangedSummarizedExperiment object (rather than “just” a SummarizedExperiment object). What does this mean? Let’s look at the information we have about the rows (transcripts) in the object:

rowRanges(st)
## GRanges object with 205870 ranges and 3 metadata columns:
##                     seqnames      ranges strand |     tx_id           gene_id           tx_name
##                        <Rle>   <IRanges>  <Rle> | <integer>   <CharacterList>       <character>
##   ENST00000456328.2     chr1 11869-14409      + |         1 ENSG00000223972.5 ENST00000456328.2
##   ENST00000450305.2     chr1 12010-13670      + |         2 ENSG00000223972.5 ENST00000450305.2
##   ENST00000488147.1     chr1 14404-29570      - |      9483 ENSG00000227232.5 ENST00000488147.1
##   ENST00000619216.1     chr1 17369-17436      - |      9484 ENSG00000278267.1 ENST00000619216.1
##   ENST00000473358.1     chr1 29554-31097      + |         3 ENSG00000243485.5 ENST00000473358.1
##                 ...      ...         ...    ... .       ...               ...               ...
##   ENST00000361681.2     chrM 14149-14673      - |    206692 ENSG00000198695.2 ENST00000361681.2
##   ENST00000387459.1     chrM 14674-14742      - |    206693 ENSG00000210194.1 ENST00000387459.1
##   ENST00000361789.2     chrM 14747-15887      + |    206684 ENSG00000198727.2 ENST00000361789.2
##   ENST00000387460.2     chrM 15888-15953      + |    206685 ENSG00000210195.2 ENST00000387460.2
##   ENST00000387461.2     chrM 15956-16023      - |    206694 ENSG00000210196.2 ENST00000387461.2
##   -------
##   seqinfo: 25 sequences (1 circular) from hg38 genome

By knowing the source and version of the reference used for the quantification, tximeta was able to retrieve the annotation files and decorate the object with information about the transcripts, such as the chromosome and position, and the corresponding gene ID. Importantly, Salmon did not use (or know about) any of this during the quantification! It needs only the transcript sequences. If we just want the annotation columns, without the ranges, we can get those with the rowData accessor:

rowData(st)
## DataFrame with 205870 rows and 3 columns
##                       tx_id           gene_id           tx_name
##                   <integer>   <CharacterList>       <character>
## ENST00000456328.2         1 ENSG00000223972.5 ENST00000456328.2
## ENST00000450305.2         2 ENSG00000223972.5 ENST00000450305.2
## ENST00000488147.1      9483 ENSG00000227232.5 ENST00000488147.1
## ENST00000619216.1      9484 ENSG00000278267.1 ENST00000619216.1
## ENST00000473358.1         3 ENSG00000243485.5 ENST00000473358.1
## ...                     ...               ...               ...
## ENST00000361681.2    206692 ENSG00000198695.2 ENST00000361681.2
## ENST00000387459.1    206693 ENSG00000210194.1 ENST00000387459.1
## ENST00000361789.2    206684 ENSG00000198727.2 ENST00000361789.2
## ENST00000387460.2    206685 ENSG00000210195.2 ENST00000387460.2
## ENST00000387461.2    206694 ENSG00000210196.2 ENST00000387461.2

Similar to the row annotations in rowData, the SummarizedExperiment object contains sample annotations in the colData slot.

colData(st)
## DataFrame with 8 rows and 3 columns
##                  names   cell_line     treatment
##            <character> <character>   <character>
## SRR1039508  SRR1039508      N61311     Untreated
## SRR1039509  SRR1039509      N61311 Dexamethasone
## SRR1039512  SRR1039512     N052611     Untreated
## SRR1039513  SRR1039513     N052611 Dexamethasone
## SRR1039516  SRR1039516     N080611     Untreated
## SRR1039517  SRR1039517     N080611 Dexamethasone
## SRR1039520  SRR1039520     N061011     Untreated
## SRR1039521  SRR1039521     N061011 Dexamethasone

4 Summarizing on the gene level

As we saw, the features in the SummarizedExperiment object above are individual transcripts, rather than genes. Often, however, we want to perform the analysis on the gene level, since the gene-level abundances are more robust and sometimes more interpretable than transcript-level abundances. The rowData contains the information about the corresponding gene for each transcript, in the gene_id column, and tximeta provides a function to summarize on the gene level:

## Summarize quantifications on the gene level
sg <- tximeta::summarizeToGene(st, assignRanges="abundant")
## loading existing TxDb created: 2022-02-09 23:19:12
## obtaining transcript-to-gene mapping from database
## loading existing gene ranges created: 2022-02-09 23:20:02
## gene ranges assigned by isoform abundance, see `assignRanges`
## summarizing abundance
## summarizing counts
## summarizing length
sg
## class: RangedSummarizedExperiment 
## dim: 58294 8 
## metadata(7): tximetaInfo quantInfo ... txdbInfo assignRanges
## assays(3): counts abundance length
## rownames(58294): ENSG00000000003.14 ENSG00000000005.5 ... ENSG00000285993.1
##   ENSG00000285994.1
## rowData names(5): gene_id tx_ids isoform max_prop iso_prop
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): names cell_line treatment

Now we have a new RangedSummarizedExperiment object, with one row per gene. The row ranges have been summarized as well, and can be used for subsetting and interpretation just as for the transcripts.

At this point, the only information we have about the genes in our data set, apart from their genomic location and the associated transcript IDs, is the Ensembl ID. Often we need additional annotations, such as gene symbols. Bioconductor provides a range of annotation packages:

For our purposes here, the appropriate OrgDb package is the most suitable, since it contains gene-centric ID conversion tables. Since this is human data, we will use the org.Hs.eg.db package.

## Add gene symbols
## Works because we know the provenance of the identifiers
metadata(sg)$txomeInfo$source # GENCODE
## [1] "GENCODE"
sg <- tximeta::addIds(sg, "SYMBOL", gene = TRUE)
## mapping to new IDs using org.Hs.eg.db
## if all matching IDs are desired, and '1:many mappings' are reported,
## set multiVals='list' to obtain all the matching IDs
## 'select()' returned 1:many mapping between keys and columns
sg
## class: RangedSummarizedExperiment 
## dim: 58294 8 
## metadata(7): tximetaInfo quantInfo ... txdbInfo assignRanges
## assays(3): counts abundance length
## rownames(58294): ENSG00000000003.14 ENSG00000000005.5 ... ENSG00000285993.1
##   ENSG00000285994.1
## rowData names(6): gene_id tx_ids ... iso_prop SYMBOL
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): names cell_line treatment
head(rowData(sg))
## DataFrame with 6 rows and 6 columns
##                               gene_id                                                     tx_ids
##                           <character>                                            <CharacterList>
## ENSG00000000003.14 ENSG00000000003.14  ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
## ENSG00000000005.5   ENSG00000000005.5                        ENST00000373031.4,ENST00000485971.1
## ENSG00000000419.12 ENSG00000000419.12  ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
## ENSG00000000457.13 ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
## ENSG00000000460.16 ENSG00000000460.16  ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
## ENSG00000000938.12 ENSG00000000938.12  ENST00000374005.7,ENST00000399173.5,ENST00000374004.5,...
##                               isoform  max_prop                             iso_prop      SYMBOL
##                           <character> <numeric>                        <NumericList> <character>
## ENSG00000000003.14  ENST00000373020.8  0.914071    0.0474288,0.9140710,0.0000000,...      TSPAN6
## ENSG00000000005.5                  NA       NaN                              NaN,NaN        TNMD
## ENSG00000000419.12  ENST00000371588.9  0.824899    0.8248988,0.0224922,0.0933611,...        DPM1
## ENSG00000000457.13 ENST00000367771.10  0.305852       0.305852,0.271815,0.237448,...       SCYL3
## ENSG00000000460.16  ENST00000413811.3  0.300272 0.08591688,0.00941284,0.15431423,...       FIRRM
## ENSG00000000938.12  ENST00000374003.7  0.500000                            0,0,0,...         FGR

To see a list of the possible columns, use the columns function from the AnnotationDbi package:

AnnotationDbi::columns(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
##  [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GENETYPE"     "GO"          
## [13] "GOALL"        "IPI"          "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL" 
## [19] "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"      
## [25] "UCSCKG"       "UNIPROT"

We can even add annotations where we expect (and would like to retain) multiple mapping values, e.g., associated GO terms:

sg <- addIds(sg, "GO", multiVals = "list", gene = TRUE)
## mapping to new IDs using org.Hs.eg.db
## if all matching IDs are desired, and '1:many mappings' are reported,
## set multiVals='list' to obtain all the matching IDs
## 'select()' returned 1:many mapping between keys and columns
head(rowData(sg))
## DataFrame with 6 rows and 7 columns
##                               gene_id                                                     tx_ids
##                           <character>                                            <CharacterList>
## ENSG00000000003.14 ENSG00000000003.14  ENST00000612152.4,ENST00000373020.8,ENST00000614008.4,...
## ENSG00000000005.5   ENSG00000000005.5                        ENST00000373031.4,ENST00000485971.1
## ENSG00000000419.12 ENSG00000000419.12  ENST00000371588.9,ENST00000466152.5,ENST00000371582.8,...
## ENSG00000000457.13 ENSG00000000457.13 ENST00000367771.10,ENST00000367770.5,ENST00000367772.8,...
## ENSG00000000460.16 ENSG00000000460.16  ENST00000498289.5,ENST00000472795.5,ENST00000496973.5,...
## ENSG00000000938.12 ENSG00000000938.12  ENST00000374005.7,ENST00000399173.5,ENST00000374004.5,...
##                               isoform  max_prop                             iso_prop      SYMBOL
##                           <character> <numeric>                        <NumericList> <character>
## ENSG00000000003.14  ENST00000373020.8  0.914071    0.0474288,0.9140710,0.0000000,...      TSPAN6
## ENSG00000000005.5                  NA       NaN                              NaN,NaN        TNMD
## ENSG00000000419.12  ENST00000371588.9  0.824899    0.8248988,0.0224922,0.0933611,...        DPM1
## ENSG00000000457.13 ENST00000367771.10  0.305852       0.305852,0.271815,0.237448,...       SCYL3
## ENSG00000000460.16  ENST00000413811.3  0.300272 0.08591688,0.00941284,0.15431423,...       FIRRM
## ENSG00000000938.12  ENST00000374003.7  0.500000                            0,0,0,...         FGR
##                                                      GO
##                                                  <list>
## ENSG00000000003.14 GO:0005515,GO:0016020,GO:0039532,...
## ENSG00000000005.5  GO:0001886,GO:0001937,GO:0005515,...
## ENSG00000000419.12 GO:0004169,GO:0004582,GO:0004582,...
## ENSG00000000457.13 GO:0000139,GO:0005515,GO:0005524,...
## ENSG00000000460.16 GO:0000278,GO:0000775,GO:0000776,...
## ENSG00000000938.12 GO:0001784,GO:0001784,GO:0001819,...

Note that Salmon returns estimated or expected counts, which are not necessarily integers. They may need to be rounded before they are passed to count-based statistical methods (however, DESeq2 will automatically round the counts, and some other methods, like edgeR, will work also with the non-integer values).

5 Representing counts for differential expression packages

At this point, we have a gene-level count matrix, contained in our SummarizedExperiment object. This is a branching point where we could use a variety of Bioconductor packages for exploration and differential expression of the count matrix, including edgeR (Robinson, McCarthy, and Smyth 2009), DESeq2 (Love, Huber, and Anders 2014), limma with the voom method (Law et al. 2014), DSS (Wu, Wang, and Wu 2013), EBSeq (Leng et al. 2013) and BaySeq (Hardcastle and Kelly 2010). We will continue using mainly DESeq2.

Bioconductor software packages often define and use a custom class for storing data that makes sure that all the needed data slots are consistently provided and fulfill any requirements. In addition, Bioconductor has general data classes (such as the SummarizedExperiment) that can be used to move data between packages. The DEFormats package can be useful for converting between different classes. The core Bioconductor classes also provide useful functionality: for example, subsetting or reordering the rows or columns of a SummarizedExperiment automatically subsets or reorders the associated rowRanges and colData, which can help to prevent accidental sample swaps that would otherwise lead to spurious results. With SummarizedExperiment this is all taken care of behind the scenes.

Each of the packages typically used for differential expression has a specific class of object used to store the summarization of the RNA-seq experiment and the intermediate quantities that are calculated during the statistical analysis of the data. DESeq2 uses a DESeqDataSet and edgeR uses a DGEList.

5.1 The DESeqDataSet, sample information, and the design formula

In DESeq2, the custom class is called DESeqDataSet. It is built on top of the SummarizedExperiment class, and it is easy to convert SummarizedExperiment objects into DESeqDataSet objects. One of the two main differences compared to a SummarizedExperiment object is that the assay slot named “counts” can be accessed using the counts accessor function, and the DESeqDataSet class enforces that the values in this matrix are non-negative integers.

A second difference is that the DESeqDataSet has an associated design formula. The experimental design is specified at the beginning of the analysis, as it will inform many of the DESeq2 functions how to treat the samples in the analysis (one exception is the size factor estimation, i.e., the adjustment for differing library sizes, which does not depend on the design formula). The design formula tells which columns in the sample information table (colData) specify the experimental design and how these factors should be used in the analysis.

Let’s remind ourselves of the design of our experiment:

colData(sg)
## DataFrame with 8 rows and 3 columns
##                  names   cell_line     treatment
##            <character> <character>   <character>
## SRR1039508  SRR1039508      N61311     Untreated
## SRR1039509  SRR1039509      N61311 Dexamethasone
## SRR1039512  SRR1039512     N052611     Untreated
## SRR1039513  SRR1039513     N052611 Dexamethasone
## SRR1039516  SRR1039516     N080611     Untreated
## SRR1039517  SRR1039517     N080611 Dexamethasone
## SRR1039520  SRR1039520     N061011     Untreated
## SRR1039521  SRR1039521     N061011 Dexamethasone

We have samples from two different conditions, and four cell lines:

table(colData(sg)$treatment)
## 
## Dexamethasone     Untreated 
##             4             4
table(colData(sg)$cell_line)
## 
## N052611 N061011 N080611  N61311 
##       2       2       2       2

We want to find the changes in gene expression that can be associated with the different treatments, but we also want to control for differences between the cell lines. The design which accomplishes this is obtained by writing ~ cell_line + treatment. By including cell_line, terms will be added to the model which account for differences between cell lines, and by adding treatment we get a term representing the treatment effect.

Note: it will be helpful for us if the first level of a factor is the reference level (e.g. control, or untreated samples). The reason is that by specifying this, functions further in the pipeline can be used and will give comparisons such as ‘treatment vs control,’ without needing to specify additional arguments.

We can relevel the treatment factor like so:

colData(sg)$treatment <- factor(colData(sg)$treatment)
colData(sg)$treatment <- relevel(colData(sg)$treatment, ref = "Untreated")
colData(sg)$treatment
## [1] Untreated     Dexamethasone Untreated     Dexamethasone Untreated     Dexamethasone
## [7] Untreated     Dexamethasone
## Levels: Untreated Dexamethasone

Also converting cell_line into a factor:

colData(sg)$cell_line <- factor(colData(sg)$cell_line)

We can use R’s formula notation to express any fixed-effects experimental design for edgeR or DESeq2. Note that these packages use the same formula notation as, for instance, the lm function of base R.

Using the ExploreModelMatrix R/Bioconductor package, we can represent our design in a graphical way. The first plot below visualizes the fitted values from the model for each combination of predictor variable levels, in terms of the estimated model coefficients. The second plot shows the number of observations per combination of levels in the predictor variables.

library(ExploreModelMatrix)
vd <- VisualizeDesign(sampleData = colData(sg), 
                      designFormula = ~ cell_line + treatment)
vd$plotlist
## [[1]]

vd$cooccurrenceplots
## [[1]]

We can also open the interactive interface to explore our design further:

ExploreModelMatrix(sampleData = colData(sg), 
                   designFormula = ~ cell_line + treatment)

To generate a DESeqDataSet object from a SummarizedExperiment object, we on