A plyranges and tximeta case study: Overlapping differential expression and differential chromatin accessibility

Instructor(s) name(s) and contact information

Michael Love, Stuart Lee, Michael Lawrence

Introduction

In this case study, we will examine a subset of the RNA-seq and ATAC-seq data from Alasoo et al. (2018) - doi: 10.1038/s41588-018-0046-7. The experiment involved treatment of macrophage cell lines from a number of human donors with IFN gamma, Salmonella infection, or both treatments combined. In the original study, the authors examined gene expression and chromatin accessibility in a subset of 86 successfully differentiated iPSC lines, and examined baseline QTL and response QTL for both expression and accessibility. Here we will not examine the genotypes, but only the expression data and chromatin accessibility data, which is publicly available. First, let us review the main finding as presented in the abstract:

We observed that ~60% of stimulus-specific expression QTLs with a detectable effect on chromatin altered the chromatin accessibility in naive cells, thus suggesting that they perturb enhancer priming. Such variants probably influence binding of cell-type-specific transcription factors, such as PU.1, which can then indirectly alter the binding of stimulus-specific transcription factors, such as NF-κB or STAT2. Thus, although chromatin accessibility assays are powerful for fine-mapping causal regulatory variants, detecting their downstream effects on gene expression will be challenging, requiring profiling of large numbers of stimulated cellular states and time points.

Scientific aim of the case study: here we will perform a much simpler analysis. We will examine the effect of IFNg stimulation on gene expression and chromatin accessibility, and look to see if there is an enrichment of differentially accessible (DA) ATAC-seq peaks in the vicinity of differentially expressed (DE) genes. This is plausible, as the transcriptomic response to IFNg stimulation may be mediated through binding of regulatory proteins to accessible regions, and this binding may increase the accessibility of those regions such that it can be detected by ATAC-seq.

Technical aim of the case study: we will show how to use the Bioconductor package plyranges and tximeta. The first package can be used to perform easily-readable manipulations of data tied to genomic ranges, e.g. shifts, windows, overlaps, etc. The plyranges package is described by Lee, Cook, and Lawrence (2019), and leverages underlying range operations described by Lawrence (2013). The second package is used to read RNA-seq quantification into R/Bioconductor, such that the genomic ranges are automatically attached to the quantification data and differential expression results.

We begin by loading the RNA-seq data contained in the Bioconductor Experiment Data package, macrophage. The package contains RNA-seq quantification from 24 RNA-seq samples, a subset of the RNA-seq samples generated and analyzed by Alasoo et al. (2018). The paired-end reads were quantified using Salmon (Patro et al. 2017), using the Gencode 29 human reference transcripts (Frankish, GENCODE-consoritum, and Flicek 2018). For more details on quantification, and the exact code used, consult the vignette of the macrophage package. The package also contains the Snakemake file that was used to distribute the Salmon quantification jobs on a cluster (Köster and Rahmann 2012).

RNA-seq data import with tximeta

First, we specify a directory dir, where the quantification files are stored. You typically would not use system.file to specify this directory. You could simply specify this directory with:

dir <- "/path/to/quant/files"

However, here we have distributed the files in an R package, so we locate that R package and the files using system.file.

The rest of the chunk reads in the coldata.csv file, selects relevant columns from this table, and created files and condition columns which point to the quantification files and specify the condition (as a factor). One last detail is that the quantification files have been gzipped, so they have the .gz ending. Typically, these would look like quant.sf without the .gz ending.

suppressPackageStartupMessages(library(plyranges))
library(macrophage)
dir <- system.file("extdata", package="macrophage")
coldata <- file.path(dir, "coldata.csv") %>% 
  read.csv() %>% 
  dplyr::select(
    names, 
    id = sample_id, 
    line = line_id, 
    condition = condition_name
  ) %>%
  mutate(
    files = file.path(dir, "quants", names, "quant.sf.gz"),
    condition = relevel(condition, "naive")
  )
head(coldata)
##            names     id   line   condition
## 1 SAMEA103885102 diku_A diku_1       naive
## 2 SAMEA103885347 diku_B diku_1        IFNg
## 3 SAMEA103885043 diku_C diku_1      SL1344
## 4 SAMEA103885392 diku_D diku_1 IFNg_SL1344
## 5 SAMEA103885182 eiwy_A eiwy_1       naive
## 6 SAMEA103885136 eiwy_B eiwy_1        IFNg
##                                                                                files
## 1 /usr/local/lib/R/site-library/macrophage/extdata/quants/SAMEA103885102/quant.sf.gz
## 2 /usr/local/lib/R/site-library/macrophage/extdata/quants/SAMEA103885347/quant.sf.gz
## 3 /usr/local/lib/R/site-library/macrophage/extdata/quants/SAMEA103885043/quant.sf.gz
## 4 /usr/local/lib/R/site-library/macrophage/extdata/quants/SAMEA103885392/quant.sf.gz
## 5 /usr/local/lib/R/site-library/macrophage/extdata/quants/SAMEA103885182/quant.sf.gz
## 6 /usr/local/lib/R/site-library/macrophage/extdata/quants/SAMEA103885136/quant.sf.gz

We next load the SummarizedExperiment and tximeta packages.

suppressPackageStartupMessages(library(SummarizedExperiment))
library(tximeta)

The following lines of code do a lot of work for the user: importing the RNA-seq quantification (dropping inferential replicates in this case), locating the relevant reference transcriptome, attaching the transcript ranges to the data, and fetching genome information. We can see what data has been imported using assayNames, which lists the assays of the SummarizedExperiment.

se <- tximeta(coldata, dropInfReps=TRUE)
## importing quantifications
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
## found matching transcriptome:
## [ Gencode - Homo sapiens - release 29 ]
## building TxDb with 'GenomicFeatures' package
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ...
## Warning in .get_cds_IDX(type, phase): The "phase" metadata column contains non-NA values for features of
##   type stop_codon. This information was ignored.
## OK
## generating transcript ranges
## fetching genome info
assayNames(se)
## [1] "counts"    "abundance" "length"

Because tximeta knows the correct reference transcriptome, we can ask tximeta to summarize the transcript-level data to the gene level using the methods of Soneson, Love, and Robinson (2015).

gse <- summarizeToGene(se)
## loading existing TxDb created: 2019-06-22 01:50:06
## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
## 
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:plyranges':
## 
##     select
## obtaining transcript-to-gene mapping from TxDb
## generating gene ranges
## summarizing abundance
## summarizing counts
## summarizing length

Basic RNA-seq DE analysis

We can easily run a differential expression analysis with DESeq2 using the following code chunks (Love, Huber, and Anders 2014). The design indicates that we want to control for the donor (line) and test on the condition.

library(DESeq2)
dds <- DESeqDataSet(gse, ~line + condition)
## using counts and average transcript lengths from tximeta
keep <- rowSums(counts(dds) >= 10) >= 6
dds <- dds[keep,]
dds <- DESeq(dds)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds, contrast=c("condition","IFNg","naive"),
               lfcThreshold=1, alpha=0.01)

Always good to do a little visualization of results:

summary(res)
## [1] "DESeqResults object of length 6 with 2 metadata columns"
DESeq2::plotMA(res, ylim=c(-10,10))

plotCounts(dds, which.min(res$pvalue), "condition")

We now output GRanges results (both DE and some non-DE for comparison later). The select command pulls out particular columns from the GRanges, in some cases renaming them so they will be unique when we look at differential accessibility (which also will have LFC and adjusted p-values).

de_genes <- results(dds, 
                    contrast=c("condition","IFNg","naive"),
                    lfcThreshold=1, 
                    format="GRanges") %>%
  filter(padj < 0.01) %>%
  mutate(gene_id=names(.)) %>%
  plyranges::select(gene_id,
                    de_log2FC = log2FoldChange,
                    de_padj = padj)

We re-run results because we don’t want to use an lfcThreshold this time, to pull out genes which are not differentially expressed according to the DESeq2 significance test.

other_genes <- results(dds, 
                       contrast=c("condition","IFNg","naive"),
                       format="GRanges") %>% 
  filter(pvalue > 0.1) %>%
  mutate(gene_id=names(.)) %>%
  plyranges::select(gene_id,
                    de_log2FC = log2FoldChange,
                    de_padj = padj)

ATAC-seq peak DA analysis

The ATAC-seq data can be downloaded from the following deposition URL:

https://zenodo.org/record/1188300#.XIAhXlNKjOQ

The data is fairly large. For running this section of workflow we need the files:

  • ATAC_cqn_matrix.txt.gz (109M)
  • ATAC_sample_metadata.txt.gz (<1M)
  • ATAC_peak_metadata.txt.gz (5.6M)

Due to the large size of the matrix file, we cannot bundle the data with this workflow repository. We therefore set the following code chunks to be unevaluated, until the creation of the peaks object.

The ATAC-seq data has already been normalized with cqn (Hansen, Irizarry, and Z 2012) and log2 transformed. Loading the cqn-normalized matrix of log2 transformed read counts takes ~30 seconds and loads an object of ~370 Mb.

atac_mat <- as.matrix(read.delim("ATAC_cqn_matrix.txt.gz"))

We also read in the sample metadata:

atac_coldata <- read.delim("ATAC_sample_metadata.txt.gz") %>% 
  plyranges::select(
    sample_id,
    donor,
    condition = condition_name
  ) %>% 
  mutate(condition = relevel(condition, "naive"))

Finally, we read in the peak metadata (locations in the genome), and convert to a GRanges object:

peaks_df <- read.delim("ATAC_peak_metadata.txt.gz", strings=FALSE)
peaks_gr <- peaks_df %>% 
  plyranges::select(seqnames = chr, start, end, peak_id=gene_id) %>% 
  as_granges()
# we know this from the Zenodo entry
# https://zenodo.org/record/1188300#.XJOFSlNKiL5
genome(peaks_gr) <- "GRCh38"

Now we make sure that our ATAC-seq quantifications and our metadata are aligned:

idx <- match(colnames(atac_mat), atac_coldata$sample_id)
atac_coldata <- atac_coldata[idx,]
all.equal(colnames(atac_mat), as.character(atac_coldata$sample_id))

We then combine the data, and two pieces of metadata into a SummarizedExperiment:

atac <- SummarizedExperiment(list(cqndata=atac_mat),
                             rowRanges=peaks_gr,
                             colData=atac_coldata)

We can check the SD over mean plot, to assess for any systematic trends:

rmu <- rowMeans(assay(atac))
rvar <- rowVars(assay(atac))
idx <- sample(nrow(atac),1e4)
plot(rmu[idx], sqrt(rvar[idx]), cex=.1)

For assessing differential accessibility, we run limma (Smyth 2004):

library(limma)
design <- model.matrix(~donor + condition, colData(atac))
fit <- lmFit(assay(atac), design)
fit <- eBayes(fit)
idx <- which(colnames(fit$coefficients) == "conditionIFNg")
tt <- topTable(fit, coef=idx, sort.by="none", n=nrow(atac))

We can plot the top gene by LFC:

idx <- which.max(tt$logFC)
plot(assay(atac)[idx,] ~ atac$condition)
table(tt$logFC > 1 & tt$adj.P.Val < .05)

We now take the rowRanges of the SummarizedExperiment and attach the LFC and adjusted p-value from limma, so that we can consider the overlap with differential expression.

peaks <- rowRanges(atac) %>% 
  mutate(
    da_log2FC = tt$logFC,
    da_padj = tt$adj.P.Val
  )
seqlevelsStyle(peaks) <- "UCSC"
genome(peaks) <- "hg38"

Due to restrictions on the size of this workflow repository, we do not evaluate the above chunks, but instead now load the peaks object that was saved after following the code above.

library(plyrangesTximetaCaseStudy)
## Warning: replacing previous import 'DESeq2::plotMA' by 'limma::plotMA' when
## loading 'plyrangesTximetaCaseStudy'
data(atac_peaks)
peaks
## GRanges object with 296220 ranges and 3 metadata columns:
##                    seqnames              ranges strand |          peak_id
##                       <Rle>           <IRanges>  <Rle> |      <character>
##        ATAC_peak_1     chr1          9979-10668      * |      ATAC_peak_1
##        ATAC_peak_2     chr1         10939-11473      * |      ATAC_peak_2
##        ATAC_peak_3     chr1         15505-15729      * |      ATAC_peak_3
##        ATAC_peak_4     chr1         21148-21481      * |      ATAC_peak_4
##        ATAC_peak_5     chr1         21864-22067      * |      ATAC_peak_5
##                ...      ...                 ...    ... .              ...
##   ATAC_peak_296216     chrX 155896572-155896835      * | ATAC_peak_296216
##   ATAC_peak_296217     chrX 155958507-155958646      * | ATAC_peak_296217
##   ATAC_peak_296218     chrX 156016760-156016975      * | ATAC_peak_296218
##   ATAC_peak_296219     chrX 156028551-156029422      * | ATAC_peak_296219
##   ATAC_peak_296220     chrX 156030135-156030785      * | ATAC_peak_296220
##                             da_log2FC              da_padj
##                             <numeric>            <numeric>
##        ATAC_peak_1  0.266185396736073 9.10672732956434e-05
##        ATAC_peak_2   0.32217712436691 2.03434717570469e-05
##        ATAC_peak_3 -0.574159538548115 3.41707743345703e-08
##        ATAC_peak_4  -1.14706617895329 8.22298606986521e-26
##        ATAC_peak_5 -0.896143162633654 4.79452571676397e-11
##                ...                ...                  ...
##   ATAC_peak_296216 -0.834628897017445  1.3354605397165e-11
##   ATAC_peak_296217 -0.147537281935847    0.313014754316915
##   ATAC_peak_296218 -0.609732301631964 3.62338775135558e-09
##   ATAC_peak_296219 -0.347678474957794 6.94823191242968e-06
##   ATAC_peak_296220  0.492442459200901 7.07663984067763e-13
##   -------
##   seqinfo: 23 sequences from hg38 genome; no seqlengths

We filter to the set of peaks with a nominal FDR bound of 1%.

da_peaks <- peaks %>% filter(da_padj < .01)

Overlap analysis with plyranges

Now that we have DE genes, genes without strong signal of DE, and DA peaks, we can perform our original aim to assess the enrichment of DA peaks near DE genes. We’ve already used plyranges a number of times above, to filter, mutate and select on GRanges objects, but now we will get into more complicated use cases.

We define a function to sample n ranges from a GRanges object x.

# sub-sample, here define a dplyr function first
sample_n <- function(x,n) x[sample.int(length(x),n,replace=FALSE)]

We downsample the number of “other” genes to the number of DE genes:

other_genes <- other_genes %>%
  sample_n(length(de_genes))

We then bind these two sets of ranges together and add a new column origin. From ?bind_ranges:

When .id is supplied a new column is created that links each row to the original Range object.

lvls <- c("not_de","de")
all_genes <- bind_ranges(de=de_genes,
                         not_de=other_genes,
                         .id="origin") %>%
  mutate(
    origin = factor(origin, lvls)
  )
all_genes
## GRanges object with 1498 ranges and 4 metadata columns:
##                      seqnames              ranges strand |
##                         <Rle>           <IRanges>  <Rle> |
##   ENSG00000000971.15     chr1 196651878-196747504      + |
##    ENSG00000001561.6     chr6   46129993-46146699      + |
##   ENSG00000002549.12     chr4   17577192-17607972      + |
##    ENSG00000002933.8     chr7 150800403-150805120      + |
##   ENSG00000004468.12     chr4   15778275-15853230      + |
##                  ...      ...                 ...    ... .
##    ENSG00000198885.9     chr2   96325331-96330517      + |
##   ENSG00000166986.14    chr12   57475445-57517569      + |
##   ENSG00000102606.18    chr13 111114559-111305737      + |
##   ENSG00000084754.11     chr2   26190635-26244726      - |
##    ENSG00000232187.1    chr13   22696023-22696574      - |
##                                 gene_id           de_log2FC
##                             <character>           <numeric>
##   ENSG00000000971.15 ENSG00000000971.15    4.98711071929916
##    ENSG00000001561.6  ENSG00000001561.6    1.92721595380563
##   ENSG00000002549.12 ENSG00000002549.12     2.9337250105913
##    ENSG00000002933.8  ENSG00000002933.8    3.16721751137952
##   ENSG00000004468.12 ENSG00000004468.12    5.40894352968134
##                  ...                ...                 ...
##    ENSG00000198885.9  ENSG00000198885.9 0.00720665048133759
##   ENSG00000166986.14 ENSG00000166986.14   0.150797356849166
##   ENSG00000102606.18 ENSG00000102606.18   0.014028256883679
##   ENSG00000084754.11 ENSG00000084754.11   0.092957651369923
##    ENSG00000232187.1  ENSG00000232187.1  -0.243630581969118
##                                   de_padj   origin
##                                 <numeric> <factor>
##   ENSG00000000971.15 1.37057048278532e-13       de
##    ENSG00000001561.6 3.17477506295398e-05       de
##   ENSG00000002549.12 2.01310386745773e-11       de
##    ENSG00000002933.8 1.07359905748361e-08       de
##   ENSG00000004468.12 4.82904683896653e-18       de
##                  ...                  ...      ...
##    ENSG00000198885.9    0.991322026145904   not_de
##   ENSG00000166986.14    0.504960163804207   not_de
##   ENSG00000102606.18    0.959516735841647   not_de
##   ENSG00000084754.11    0.505551215678081   not_de
##    ENSG00000232187.1    0.707253683120706   not_de
##   -------
##   seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths

We expand 10kb around the TSS of the genes using mutate with width=1e4, and then perform a left join with the DA peaks. This gives us the all_genes ranges (potentially with duplication), but with the metadata columns from those overlapping DA peaks. For any gene that has no overlaps, the DA peak columns will have NA’s.

overlap_genes <- all_genes %>%
  mutate(width=1e4) %>%
  join_overlap_left(da_peaks)
overlap_genes
## GRanges object with 3318 ranges and 7 metadata columns:
##                      seqnames              ranges strand |
##                         <Rle>           <IRanges>  <Rle> |
##   ENSG00000000971.15     chr1 196651878-196661877      + |
##    ENSG00000001561.6     chr6   46129993-46139992      + |
##   ENSG00000002549.12     chr4   17577192-17587191      + |
##   ENSG00000002549.12     chr4   17577192-17587191      + |
##   ENSG00000002549.12     chr4   17577192-17587191      + |
##                  ...      ...                 ...    ... .
##   ENSG00000102606.18    chr13 111114559-111124558      + |
##   ENSG00000102606.18    chr13 111114559-111124558      + |
##   ENSG00000102606.18    chr13 111114559-111124558      + |
##   ENSG00000084754.11     chr2   26190635-26200634      - |
##    ENSG00000232187.1    chr13   22696023-22706022      - |
##                                 gene_id          de_log2FC
##                             <character>          <numeric>
##   ENSG00000000971.15 ENSG00000000971.15   4.98711071929916
##    ENSG00000001561.6  ENSG00000001561.6   1.92721595380563
##   ENSG00000002549.12 ENSG00000002549.12    2.9337250105913
##   ENSG00000002549.12 ENSG00000002549.12    2.9337250105913
##   ENSG00000002549.12 ENSG00000002549.12    2.9337250105913
##                  ...                ...                ...
##   ENSG00000102606.18 ENSG00000102606.18  0.014028256883679
##   ENSG00000102606.18 ENSG00000102606.18  0.014028256883679
##   ENSG00000102606.18 ENSG00000102606.18  0.014028256883679
##   ENSG00000084754.11 ENSG00000084754.11  0.092957651369923
##    ENSG00000232187.1  ENSG00000232187.1 -0.243630581969118
##                                   de_padj   origin          peak_id
##                                 <numeric> <factor>      <character>
##   ENSG00000000971.15 1.37057048278532e-13       de  ATAC_peak_21236
##    ENSG00000001561.6 3.17477506295398e-05       de ATAC_peak_231183
##   ENSG00000002549.12 2.01310386745773e-11       de ATAC_peak_193578
##   ENSG00000002549.12 2.01310386745773e-11       de ATAC_peak_193579
##   ENSG00000002549.12 2.01310386745773e-11       de ATAC_peak_193580
##                  ...                  ...      ...              ...
##   ENSG00000102606.18    0.959516735841647   not_de  ATAC_peak_78335
##   ENSG00000102606.18    0.959516735841647   not_de  ATAC_peak_78336
##   ENSG00000102606.18    0.959516735841647   not_de  ATAC_peak_78337
##   ENSG00000084754.11    0.505551215678081   not_de ATAC_peak_133596
##    ENSG00000232187.1    0.707253683120706   not_de             <NA>
##                               da_log2FC              da_padj
##                               <numeric>            <numeric>
##   ENSG00000000971.15 -0.546582189082724 0.000115273676444232
##    ENSG00000001561.6   1.45329684862127  9.7322474682763e-17
##   ENSG00000002549.12  0.222371496904895 3.00939005719989e-11
##   ENSG00000002549.12 -0.281615137872819 7.99888515457195e-05
##   ENSG00000002549.12  0.673705317951604 7.60042918890061e-15
##                  ...                ...                  ...
##   ENSG00000102606.18 -0.540341783905459 3.95332799461161e-13
##   ENSG00000102606.18 -0.986233061339336 5.76598321668095e-17
##   ENSG00000102606.18 -0.687331485649972 3.75825714506234e-08
##   ENSG00000084754.11   -1.8694606077659 2.85814377477119e-31
##    ENSG00000232187.1               <NA>                 <NA>
##   -------
##   seqinfo: 26 sequences (1 circular) from an unspecified genome; no seqlengths

Now we can ask, how many DA peaks are near DE genes relative to “other” genes? Note that a gene may appear more than once, because we performed a left join.

We group by DE vs “other”, and sum up the DA peaks at any LFC, or requiring an LFC of greater than 1, or greater than 2.

overlap_tab <- overlap_genes %>%
  group_by(origin) %>%
  summarize(any=sum(!is.na(da_padj)),
            lfc1=sum(abs(da_log2FC) > 1, na.rm=TRUE),
            lfc2=sum(abs(da_log2FC) > 2, na.rm=TRUE))
overlap_tab
## DataFrame with 2 rows and 4 columns
##     origin       any      lfc1      lfc2
##   <factor> <integer> <integer> <integer>
## 1   not_de      1198       207        14
## 2       de      1833       605       131

We can see that the enrichment increases for larger LFC threshold:

sapply(overlap_tab[,-1], function(x) x[2]/x[1])
##      any     lfc1     lfc2 
## 1.530050 2.922705 9.357143

We can ask for the top peak per gene based on the peak’s LFC, and pipe that directly into a boxplot using ggplot2:

library(ggplot2)
overlap_genes %>%
  group_by(gene_id) %>%
  plyranges::filter(abs(da_log2FC) == max(abs(da_log2FC))) %>%
  mcols %>% as.data.frame %>%
  ggplot(aes(origin, abs(da_log2FC))) +
  geom_boxplot()

Further questions

  • We specified a fixed-size window for computing overlaps. How could we have looked at various settings of this distance parameter?
  • We computed the sum of DA peaks near the DE genes, for increasing LFC thresholds on the accessibility change. As we increased the threshold, the number of total peaks went down (likewise the mean number of DA peaks per gene). Likely the number of DE genes with a DA peak nearby with such a large change went down, but we didn’t keep track of this. How could we have computed this?
  • We have some duplicate code in the summarize call, where we change only the LFC threshold. How could we perform this same operation over, say, a grid of 100 thresholds, without having to write the redundant code?
  • How would you adapt the code to also consider variations on the thresholds applied to the DE genes (FDR 1%, testing against a null region of |LFC| < 1), and to the FDR cutoff (1%) for the DA peaks?
  • We relied on the fact that the reference transcriptome was known to tximeta. What if this had not been the case?

References

Alasoo, K, J Rodrigues, S Mukhopadhyay, AJ Knights, AL Mann, K Kundu, HIPSCI-Consortium, C Hale, Dougan G, and DJ Gaffney. 2018. “Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response.” Nature Genetics 50: 424–31. https://doi.org/10.1038/s41588-018-0046-7.

Frankish, A, GENCODE-consoritum, and P Flicek. 2018. “GENCODE reference annotation for the human and mouse genomes.” Nucleic Acids Research. https://doi.org/10.1093/nar/gky955.

Hansen, KD, RA Irizarry, and Wu Z. 2012. “Removing technical variability in RNA-seq data using condition quantile normalization.” Biostatistics 13 (2): 204–16. https://doi.org/10.1093/biostatistics/kxr054.

Köster, Johannes, and Sven Rahmann. 2012. “Snakemake—a scalable bioinformatics workflow engine.” Bioinformatics 28 (19): 2520–2. https://doi.org/10.1093/bioinformatics/bts480.

Lawrence, Wolfgang AND Pagès, Michael AND Huber. 2013. “Software for Computing and Annotating Genomic Ranges.” PLOS Computational Biology 9 (8). Public Library of Science: 1–10. https://doi.org/10.1371/journal.pcbi.1003118.

Lee, Stuart, Dianne Cook, and Michael Lawrence. 2019. “Plyranges: A Grammar of Genomic Data Transformation.” Genome Biology 20 (1): 4. https://doi.org/10.1186/s13059-018-1597-8.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.

Patro, R, G Duggal, MI Love, RA Irizarry, and C Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nature Methods 14: 417–19. https://doi.org/10.1038/nmeth.4197.

Smyth, Gordon K. 2004. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.” Statistical Applications in Genetics and Molecular Biology 3 (1).

Soneson, Charlotte, Michael I. Love, and Mark Robinson. 2015. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.” F1000Research 4 (1521). https://doi.org/10.12688/f1000research.7563.1.