A plyranges and tximeta case study: Overlapping differential expression and differential chromatin accessibility
Instructor(s) name(s) and contact information
Introduction
In this case study, we will examine a subset of the RNA-seq and ATAC-seq data from Alasoo et al. (2018) - doi: 10.1038/s41588-018-0046-7. The experiment involved treatment of macrophage cell lines from a number of human donors with IFN gamma, Salmonella infection, or both treatments combined. In the original study, the authors examined gene expression and chromatin accessibility in a subset of 86 successfully differentiated iPSC lines, and examined baseline QTL and response QTL for both expression and accessibility. Here we will not examine the genotypes, but only the expression data and chromatin accessibility data, which is publicly available. First, let us review the main finding as presented in the abstract:
We observed that ~60% of stimulus-specific expression QTLs with a detectable effect on chromatin altered the chromatin accessibility in naive cells, thus suggesting that they perturb enhancer priming. Such variants probably influence binding of cell-type-specific transcription factors, such as PU.1, which can then indirectly alter the binding of stimulus-specific transcription factors, such as NF-κB or STAT2. Thus, although chromatin accessibility assays are powerful for fine-mapping causal regulatory variants, detecting their downstream effects on gene expression will be challenging, requiring profiling of large numbers of stimulated cellular states and time points.
Scientific aim of the case study: here we will perform a much simpler analysis. We will examine the effect of IFNg stimulation on gene expression and chromatin accessibility, and look to see if there is an enrichment of differentially accessible (DA) ATAC-seq peaks in the vicinity of differentially expressed (DE) genes. This is plausible, as the transcriptomic response to IFNg stimulation may be mediated through binding of regulatory proteins to accessible regions, and this binding may increase the accessibility of those regions such that it can be detected by ATAC-seq.
Technical aim of the case study: we will show how to use the Bioconductor package plyranges and tximeta. The first package can be used to perform easily-readable manipulations of data tied to genomic ranges, e.g. shifts, windows, overlaps, etc. The plyranges package is described by Lee, Cook, and Lawrence (2019), and leverages underlying range operations described by Lawrence (2013). The second package is used to read RNA-seq quantification into R/Bioconductor, such that the genomic ranges are automatically attached to the quantification data and differential expression results.
We begin by loading the RNA-seq data contained in the Bioconductor
Experiment Data package, macrophage. The package contains RNA-seq
quantification from 24 RNA-seq samples, a subset of the
RNA-seq samples generated and analyzed by Alasoo et al. (2018). The paired-end
reads were quantified using Salmon (Patro et al. 2017), using the Gencode 29
human reference transcripts (Frankish, GENCODE-consoritum, and Flicek 2018). For more details on
quantification, and the exact code used, consult the vignette of the
macrophage package.
The package also contains the Snakemake
file that was used to
distribute the Salmon quantification jobs on a cluster
(Köster and Rahmann 2012).
RNA-seq data import with tximeta
First, we specify a directory dir
, where the quantification files
are stored. You typically would not use system.file
to specify
this directory. You could simply specify this directory with:
dir <- "/path/to/quant/files"
However, here we have distributed the files in an R package, so we
locate that R package and the files using system.file
.
The rest of the chunk reads in the coldata.csv
file, selects
relevant columns from this table, and created files
and
condition
columns which point to the quantification files and
specify the condition (as a factor). One last detail is that the
quantification files have been gzipped, so they have the .gz
ending. Typically, these would look like quant.sf
without the .gz
ending.
suppressPackageStartupMessages(library(plyranges))
library(macrophage)
dir <- system.file("extdata", package="macrophage")
coldata <- file.path(dir, "coldata.csv") %>%
read.csv() %>%
dplyr::select(
names,
id = sample_id,
line = line_id,
condition = condition_name
) %>%
mutate(
files = file.path(dir, "quants", names, "quant.sf.gz"),
condition = relevel(condition, "naive")
)
head(coldata)
## names id line condition
## 1 SAMEA103885102 diku_A diku_1 naive
## 2 SAMEA103885347 diku_B diku_1 IFNg
## 3 SAMEA103885043 diku_C diku_1 SL1344
## 4 SAMEA103885392 diku_D diku_1 IFNg_SL1344
## 5 SAMEA103885182 eiwy_A eiwy_1 naive
## 6 SAMEA103885136 eiwy_B eiwy_1 IFNg
## files
## 1 /usr/local/lib/R/site-library/macrophage/extdata/quants/SAMEA103885102/quant.sf.gz
## 2 /usr/local/lib/R/site-library/macrophage/extdata/quants/SAMEA103885347/quant.sf.gz
## 3 /usr/local/lib/R/site-library/macrophage/extdata/quants/SAMEA103885043/quant.sf.gz
## 4 /usr/local/lib/R/site-library/macrophage/extdata/quants/SAMEA103885392/quant.sf.gz
## 5 /usr/local/lib/R/site-library/macrophage/extdata/quants/SAMEA103885182/quant.sf.gz
## 6 /usr/local/lib/R/site-library/macrophage/extdata/quants/SAMEA103885136/quant.sf.gz
We next load the SummarizedExperiment and tximeta packages.
suppressPackageStartupMessages(library(SummarizedExperiment))
library(tximeta)
The following lines of code do a lot of work for the user: importing
the RNA-seq quantification (dropping inferential replicates in this
case), locating the relevant reference transcriptome, attaching
the transcript ranges to the data, and fetching genome information. We
can see what data has been imported using assayNames
, which lists
the assays of the SummarizedExperiment.
se <- tximeta(coldata, dropInfReps=TRUE)
## importing quantifications
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## found matching transcriptome:
## [ Gencode - Homo sapiens - release 29 ]
## building TxDb with 'GenomicFeatures' package
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ...
## Warning in .get_cds_IDX(type, phase): The "phase" metadata column contains non-NA values for features of
## type stop_codon. This information was ignored.
## OK
## generating transcript ranges
## fetching genome info
assayNames(se)
## [1] "counts" "abundance" "length"
Because tximeta
knows the correct reference transcriptome, we can
ask tximeta
to summarize the transcript-level data to the gene level
using the methods of Soneson, Love, and Robinson (2015).
gse <- summarizeToGene(se)
## loading existing TxDb created: 2019-06-22 01:50:06
## Loading required package: GenomicFeatures
## Loading required package: AnnotationDbi
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:plyranges':
##
## select
## obtaining transcript-to-gene mapping from TxDb
## generating gene ranges
## summarizing abundance
## summarizing counts
## summarizing length
Basic RNA-seq DE analysis
We can easily run a differential expression analysis with DESeq2
using the following code chunks (Love, Huber, and Anders 2014). The design indicates that
we want to control for the donor (line
) and test on the condition.
library(DESeq2)
dds <- DESeqDataSet(gse, ~line + condition)
## using counts and average transcript lengths from tximeta
keep <- rowSums(counts(dds) >= 10) >= 6
dds <- dds[keep,]
dds <- DESeq(dds)
## estimating size factors
## using 'avgTxLength' from assays(dds), correcting for library size
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds, contrast=c("condition","IFNg","naive"),
lfcThreshold=1, alpha=0.01)
Always good to do a little visualization of results:
summary(res)
## [1] "DESeqResults object of length 6 with 2 metadata columns"
DESeq2::plotMA(res, ylim=c(-10,10))
plotCounts(dds, which.min(res$pvalue), "condition")
We now output GRanges results (both DE and some non-DE for
comparison later). The select
command pulls out particular columns
from the GRanges, in some cases renaming them so they will be unique
when we look at differential accessibility (which also will have LFC
and adjusted p-values).
de_genes <- results(dds,
contrast=c("condition","IFNg","naive"),
lfcThreshold=1,
format="GRanges") %>%
filter(padj < 0.01) %>%
mutate(gene_id=names(.)) %>%
plyranges::select(gene_id,
de_log2FC = log2FoldChange,
de_padj = padj)
We re-run results
because we don’t want to use an lfcThreshold
this time, to pull out genes which are not differentially expressed
according to the DESeq2 significance test.
other_genes <- results(dds,
contrast=c("condition","IFNg","naive"),
format="GRanges") %>%
filter(pvalue > 0.1) %>%
mutate(gene_id=names(.)) %>%
plyranges::select(gene_id,
de_log2FC = log2FoldChange,
de_padj = padj)
ATAC-seq peak DA analysis
The ATAC-seq data can be downloaded from the following deposition URL:
https://zenodo.org/record/1188300#.XIAhXlNKjOQ
The data is fairly large. For running this section of workflow we need the files:
ATAC_cqn_matrix.txt.gz
(109M)ATAC_sample_metadata.txt.gz
(<1M)ATAC_peak_metadata.txt.gz
(5.6M)
Due to the large size of the matrix file, we cannot bundle the data
with this workflow repository. We therefore set the following code
chunks to be unevaluated, until the creation of the peaks
object.
The ATAC-seq data has already been normalized with cqn (Hansen, Irizarry, and Z 2012) and log2 transformed. Loading the cqn-normalized matrix of log2 transformed read counts takes ~30 seconds and loads an object of ~370 Mb.
atac_mat <- as.matrix(read.delim("ATAC_cqn_matrix.txt.gz"))
We also read in the sample metadata:
atac_coldata <- read.delim("ATAC_sample_metadata.txt.gz") %>%
plyranges::select(
sample_id,
donor,
condition = condition_name
) %>%
mutate(condition = relevel(condition, "naive"))
Finally, we read in the peak metadata (locations in the genome), and convert to a GRanges object:
peaks_df <- read.delim("ATAC_peak_metadata.txt.gz", strings=FALSE)
peaks_gr <- peaks_df %>%
plyranges::select(seqnames = chr, start, end, peak_id=gene_id) %>%
as_granges()
# we know this from the Zenodo entry
# https://zenodo.org/record/1188300#.XJOFSlNKiL5
genome(peaks_gr) <- "GRCh38"
Now we make sure that our ATAC-seq quantifications and our metadata are aligned:
idx <- match(colnames(atac_mat), atac_coldata$sample_id)
atac_coldata <- atac_coldata[idx,]
all.equal(colnames(atac_mat), as.character(atac_coldata$sample_id))
We then combine the data, and two pieces of metadata into a SummarizedExperiment:
atac <- SummarizedExperiment(list(cqndata=atac_mat),
rowRanges=peaks_gr,
colData=atac_coldata)
We can check the SD over mean plot, to assess for any systematic trends:
rmu <- rowMeans(assay(atac))
rvar <- rowVars(assay(atac))
idx <- sample(nrow(atac),1e4)
plot(rmu[idx], sqrt(rvar[idx]), cex=.1)
For assessing differential accessibility, we run limma (Smyth 2004):
library(limma)
design <- model.matrix(~donor + condition, colData(atac))
fit <- lmFit(assay(atac), design)
fit <- eBayes(fit)
idx <- which(colnames(fit$coefficients) == "conditionIFNg")
tt <- topTable(fit, coef=idx, sort.by="none", n=nrow(atac))
We can plot the top gene by LFC:
idx <- which.max(tt$logFC)
plot(assay(atac)[idx,] ~ atac$condition)
table(tt$logFC > 1 & tt$adj.P.Val < .05)
We now take the rowRanges
of the SummarizedExperiment and attach
the LFC and adjusted p-value from limma, so that we can consider the
overlap with differential expression.
peaks <- rowRanges(atac) %>%
mutate(
da_log2FC = tt$logFC,
da_padj = tt$adj.P.Val
)
seqlevelsStyle(peaks) <- "UCSC"
genome(peaks) <- "hg38"
Due to restrictions on the size of this workflow repository, we do not
evaluate the above chunks, but instead now load the peaks
object
that was saved after following the code above.
library(plyrangesTximetaCaseStudy)
## Warning: replacing previous import 'DESeq2::plotMA' by 'limma::plotMA' when
## loading 'plyrangesTximetaCaseStudy'
data(atac_peaks)
peaks
## GRanges object with 296220 ranges and 3 metadata columns:
## seqnames ranges strand | peak_id
## <Rle> <IRanges> <Rle> | <character>
## ATAC_peak_1 chr1 9979-10668 * | ATAC_peak_1
## ATAC_peak_2 chr1 10939-11473 * | ATAC_peak_2
## ATAC_peak_3 chr1 15505-15729 * | ATAC_peak_3
## ATAC_peak_4 chr1 21148-21481 * | ATAC_peak_4
## ATAC_peak_5 chr1 21864-22067 * | ATAC_peak_5
## ... ... ... ... . ...
## ATAC_peak_296216 chrX 155896572-155896835 * | ATAC_peak_296216
## ATAC_peak_296217 chrX 155958507-155958646 * | ATAC_peak_296217
## ATAC_peak_296218 chrX 156016760-156016975 * | ATAC_peak_296218
## ATAC_peak_296219 chrX 156028551-156029422 * | ATAC_peak_296219
## ATAC_peak_296220 chrX 156030135-156030785 * | ATAC_peak_296220
## da_log2FC da_padj
## <numeric> <numeric>
## ATAC_peak_1 0.266185396736073 9.10672732956434e-05
## ATAC_peak_2 0.32217712436691 2.03434717570469e-05
## ATAC_peak_3 -0.574159538548115 3.41707743345703e-08
## ATAC_peak_4 -1.14706617895329 8.22298606986521e-26
## ATAC_peak_5 -0.896143162633654 4.79452571676397e-11
## ... ... ...
## ATAC_peak_296216 -0.834628897017445 1.3354605397165e-11
## ATAC_peak_296217 -0.147537281935847 0.313014754316915
## ATAC_peak_296218 -0.609732301631964 3.62338775135558e-09
## ATAC_peak_296219 -0.347678474957794 6.94823191242968e-06
## ATAC_peak_296220 0.492442459200901 7.07663984067763e-13
## -------
## seqinfo: 23 sequences from hg38 genome; no seqlengths
We filter to the set of peaks with a nominal FDR bound of 1%.
da_peaks <- peaks %>% filter(da_padj < .01)
Overlap analysis with plyranges
Now that we have DE genes, genes without strong signal of DE, and DA
peaks, we can perform our original aim to assess the enrichment of DA
peaks near DE genes. We’ve already used plyranges a number of times
above, to filter
, mutate
and select
on GRanges objects, but
now we will get into more complicated use cases.
We define a function to sample n
ranges from a GRanges object x
.
# sub-sample, here define a dplyr function first
sample_n <- function(x,n) x[sample.int(length(x),n,replace=FALSE)]
We downsample the number of “other” genes to the number of DE genes:
other_genes <- other_genes %>%
sample_n(length(de_genes))
We then bind these two sets of ranges together and add a new column
origin
. From ?bind_ranges
:
When
.id
is supplied a new column is created that links each row to the original Range object.
lvls <- c("not_de","de")
all_genes <- bind_ranges(de=de_genes,
not_de=other_genes,
.id="origin") %>%
mutate(
origin = factor(origin, lvls)
)
all_genes
## GRanges object with 1498 ranges and 4 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## ENSG00000000971.15 chr1 196651878-196747504 + |
## ENSG00000001561.6 chr6 46129993-46146699 + |
## ENSG00000002549.12 chr4 17577192-17607972 + |
## ENSG00000002933.8 chr7 150800403-150805120 + |
## ENSG00000004468.12 chr4 15778275-15853230 + |
## ... ... ... ... .
## ENSG00000198885.9 chr2 96325331-96330517 + |
## ENSG00000166986.14 chr12 57475445-57517569 + |
## ENSG00000102606.18 chr13 111114559-111305737 + |
## ENSG00000084754.11 chr2 26190635-26244726 - |
## ENSG00000232187.1 chr13 22696023-22696574 - |
## gene_id de_log2FC
## <character> <numeric>
## ENSG00000000971.15 ENSG00000000971.15 4.98711071929916
## ENSG00000001561.6 ENSG00000001561.6 1.92721595380563
## ENSG00000002549.12 ENSG00000002549.12 2.9337250105913
## ENSG00000002933.8 ENSG00000002933.8 3.16721751137952
## ENSG00000004468.12 ENSG00000004468.12 5.40894352968134
## ... ... ...
## ENSG00000198885.9 ENSG00000198885.9 0.00720665048133759
## ENSG00000166986.14 ENSG00000166986.14 0.150797356849166
## ENSG00000102606.18 ENSG00000102606.18 0.014028256883679
## ENSG00000084754.11 ENSG00000084754.11 0.092957651369923
## ENSG00000232187.1 ENSG00000232187.1 -0.243630581969118
## de_padj origin
## <numeric> <factor>
## ENSG00000000971.15 1.37057048278532e-13 de
## ENSG00000001561.6 3.17477506295398e-05 de
## ENSG00000002549.12 2.01310386745773e-11 de
## ENSG00000002933.8 1.07359905748361e-08 de
## ENSG00000004468.12 4.82904683896653e-18 de
## ... ... ...
## ENSG00000198885.9 0.991322026145904 not_de
## ENSG00000166986.14 0.504960163804207 not_de
## ENSG00000102606.18 0.959516735841647 not_de
## ENSG00000084754.11 0.505551215678081 not_de
## ENSG00000232187.1 0.707253683120706 not_de
## -------
## seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths
We expand 10kb around the TSS of the genes using mutate
with
width=1e4
, and then perform a left join with the DA peaks. This
gives us the all_genes
ranges (potentially with duplication), but
with the metadata columns from those overlapping DA peaks. For any
gene that has no overlaps, the DA peak columns will have NA
’s.
overlap_genes <- all_genes %>%
mutate(width=1e4) %>%
join_overlap_left(da_peaks)
overlap_genes
## GRanges object with 3318 ranges and 7 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## ENSG00000000971.15 chr1 196651878-196661877 + |
## ENSG00000001561.6 chr6 46129993-46139992 + |
## ENSG00000002549.12 chr4 17577192-17587191 + |
## ENSG00000002549.12 chr4 17577192-17587191 + |
## ENSG00000002549.12 chr4 17577192-17587191 + |
## ... ... ... ... .
## ENSG00000102606.18 chr13 111114559-111124558 + |
## ENSG00000102606.18 chr13 111114559-111124558 + |
## ENSG00000102606.18 chr13 111114559-111124558 + |
## ENSG00000084754.11 chr2 26190635-26200634 - |
## ENSG00000232187.1 chr13 22696023-22706022 - |
## gene_id de_log2FC
## <character> <numeric>
## ENSG00000000971.15 ENSG00000000971.15 4.98711071929916
## ENSG00000001561.6 ENSG00000001561.6 1.92721595380563
## ENSG00000002549.12 ENSG00000002549.12 2.9337250105913
## ENSG00000002549.12 ENSG00000002549.12 2.9337250105913
## ENSG00000002549.12 ENSG00000002549.12 2.9337250105913
## ... ... ...
## ENSG00000102606.18 ENSG00000102606.18 0.014028256883679
## ENSG00000102606.18 ENSG00000102606.18 0.014028256883679
## ENSG00000102606.18 ENSG00000102606.18 0.014028256883679
## ENSG00000084754.11 ENSG00000084754.11 0.092957651369923
## ENSG00000232187.1 ENSG00000232187.1 -0.243630581969118
## de_padj origin peak_id
## <numeric> <factor> <character>
## ENSG00000000971.15 1.37057048278532e-13 de ATAC_peak_21236
## ENSG00000001561.6 3.17477506295398e-05 de ATAC_peak_231183
## ENSG00000002549.12 2.01310386745773e-11 de ATAC_peak_193578
## ENSG00000002549.12 2.01310386745773e-11 de ATAC_peak_193579
## ENSG00000002549.12 2.01310386745773e-11 de ATAC_peak_193580
## ... ... ... ...
## ENSG00000102606.18 0.959516735841647 not_de ATAC_peak_78335
## ENSG00000102606.18 0.959516735841647 not_de ATAC_peak_78336
## ENSG00000102606.18 0.959516735841647 not_de ATAC_peak_78337
## ENSG00000084754.11 0.505551215678081 not_de ATAC_peak_133596
## ENSG00000232187.1 0.707253683120706 not_de <NA>
## da_log2FC da_padj
## <numeric> <numeric>
## ENSG00000000971.15 -0.546582189082724 0.000115273676444232
## ENSG00000001561.6 1.45329684862127 9.7322474682763e-17
## ENSG00000002549.12 0.222371496904895 3.00939005719989e-11
## ENSG00000002549.12 -0.281615137872819 7.99888515457195e-05
## ENSG00000002549.12 0.673705317951604 7.60042918890061e-15
## ... ... ...
## ENSG00000102606.18 -0.540341783905459 3.95332799461161e-13
## ENSG00000102606.18 -0.986233061339336 5.76598321668095e-17
## ENSG00000102606.18 -0.687331485649972 3.75825714506234e-08
## ENSG00000084754.11 -1.8694606077659 2.85814377477119e-31
## ENSG00000232187.1 <NA> <NA>
## -------
## seqinfo: 26 sequences (1 circular) from an unspecified genome; no seqlengths
Now we can ask, how many DA peaks are near DE genes relative to “other” genes? Note that a gene may appear more than once, because we performed a left join.
We group by DE vs “other”, and sum up the DA peaks at any LFC, or requiring an LFC of greater than 1, or greater than 2.
overlap_tab <- overlap_genes %>%
group_by(origin) %>%
summarize(any=sum(!is.na(da_padj)),
lfc1=sum(abs(da_log2FC) > 1, na.rm=TRUE),
lfc2=sum(abs(da_log2FC) > 2, na.rm=TRUE))
overlap_tab
## DataFrame with 2 rows and 4 columns
## origin any lfc1 lfc2
## <factor> <integer> <integer> <integer>
## 1 not_de 1198 207 14
## 2 de 1833 605 131
We can see that the enrichment increases for larger LFC threshold:
sapply(overlap_tab[,-1], function(x) x[2]/x[1])
## any lfc1 lfc2
## 1.530050 2.922705 9.357143
We can ask for the top peak per gene based on the peak’s LFC, and pipe that directly into a boxplot using ggplot2:
library(ggplot2)
overlap_genes %>%
group_by(gene_id) %>%
plyranges::filter(abs(da_log2FC) == max(abs(da_log2FC))) %>%
mcols %>% as.data.frame %>%
ggplot(aes(origin, abs(da_log2FC))) +
geom_boxplot()
Further questions
- We specified a fixed-size window for computing overlaps. How could we have looked at various settings of this distance parameter?
- We computed the sum of DA peaks near the DE genes, for increasing LFC thresholds on the accessibility change. As we increased the threshold, the number of total peaks went down (likewise the mean number of DA peaks per gene). Likely the number of DE genes with a DA peak nearby with such a large change went down, but we didn’t keep track of this. How could we have computed this?
- We have some duplicate code in the
summarize
call, where we change only the LFC threshold. How could we perform this same operation over, say, a grid of 100 thresholds, without having to write the redundant code? - How would you adapt the code to also consider variations on the thresholds applied to the DE genes (FDR 1%, testing against a null region of |LFC| < 1), and to the FDR cutoff (1%) for the DA peaks?
- We relied on the fact that the reference transcriptome was known to
tximeta
. What if this had not been the case?
References
Alasoo, K, J Rodrigues, S Mukhopadhyay, AJ Knights, AL Mann, K Kundu, HIPSCI-Consortium, C Hale, Dougan G, and DJ Gaffney. 2018. “Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response.” Nature Genetics 50: 424–31. https://doi.org/10.1038/s41588-018-0046-7.
Frankish, A, GENCODE-consoritum, and P Flicek. 2018. “GENCODE reference annotation for the human and mouse genomes.” Nucleic Acids Research. https://doi.org/10.1093/nar/gky955.
Hansen, KD, RA Irizarry, and Wu Z. 2012. “Removing technical variability in RNA-seq data using condition quantile normalization.” Biostatistics 13 (2): 204–16. https://doi.org/10.1093/biostatistics/kxr054.
Köster, Johannes, and Sven Rahmann. 2012. “Snakemake—a scalable bioinformatics workflow engine.” Bioinformatics 28 (19): 2520–2. https://doi.org/10.1093/bioinformatics/bts480.
Lawrence, Wolfgang AND Pagès, Michael AND Huber. 2013. “Software for Computing and Annotating Genomic Ranges.” PLOS Computational Biology 9 (8). Public Library of Science: 1–10. https://doi.org/10.1371/journal.pcbi.1003118.
Lee, Stuart, Dianne Cook, and Michael Lawrence. 2019. “Plyranges: A Grammar of Genomic Data Transformation.” Genome Biology 20 (1): 4. https://doi.org/10.1186/s13059-018-1597-8.
Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.
Patro, R, G Duggal, MI Love, RA Irizarry, and C Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nature Methods 14: 417–19. https://doi.org/10.1038/nmeth.4197.
Smyth, Gordon K. 2004. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.” Statistical Applications in Genetics and Molecular Biology 3 (1).
Soneson, Charlotte, Michael I. Love, and Mark Robinson. 2015. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.” F1000Research 4 (1521). https://doi.org/10.12688/f1000research.7563.1.