R is a very popular open-source statistical programming language, with lots of interesting features and challenging quirks. R is used in many aspects of data analysis, by people ranging from students in small academic groups to professional engineers in the largest social media companies. This tutorial takes us from the basics of R to advanced features that are particularly interesting to scientists and engineers. We’ll talk about ‘atomic’ vectors, functional and vectorized computations, R’s unique class systems, visualization, extending R to process large data, and literate programming. A lot of material for a couple of hours, but it’ll be fun.
Today
R
Bioconductor
Basics of R
‘Atomic’ vectors (raw, logical, character, integer, numeric, complex, list)
1 + 2
## [1] 3
rnorm(10)
## [1] -0.44434257 2.25297871 -1.91564662 -0.01458099 0.37155174
## [6] -0.60815538 1.81692434 0.15781449 -1.23618066 -1.74224901
x <- rnorm(1000)
y <- x + rnorm(1000, sd=.5)
plot(y ~ x)
More complicated data structures
## data.frame: rectangular, homogeneous within but not between columns
df <- data.frame(X=x, Y=y)
head(df)
## X Y
## 1 1.81963041 1.6082495
## 2 -0.90109341 -1.2308712
## 3 0.80206844 1.0635343
## 4 0.04485260 0.1397213
## 5 0.65105337 0.2969279
## 6 -0.07158132 1.0479738
## 'S3' class; complicated structure managed via API
fit <- lm(Y ~ X, df)
anova(fit)
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## X 1 1006.78 1006.78 3930.7 < 2.2e-16 ***
## Residuals 998 255.62 0.26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(Y ~ X, df, asp=1)
abline(fit, lwd=2, col="cyan")
Statistical concepts, e.g.,
NA: missing valuesfactor(): vector of finite levels, e.g., ‘Male’, ‘Female’, ‘Hermaphrodite’~: FormulaPackages: Core and user contributed; usually domain-specific rather than ‘infrastructure’
library(ggplot2) # part of the 'Hadleyverse'
ggplot(df, aes(x=X, y=Y)) + geom_point() + geom_smooth()
## Warning in grid.Call.graphics(L_polygon, x$x, x$y, index): semi-
## transparency is not supported on this device: reported only once per page
Help
?rnorm: man pages with technical detailsSimple and advanced statistical analysis intrinsic to R: descriptive statistics, simple and advanced linear models, machine learning, …
stats::kmeans(): a basic implementation, in a core packagekernlab::kkmeans(): a kernel-based implementation, in a contributed package – check out the vignette, vignette("kernlab")R-flavored markdown, .Rmd files
R code chunks
```{r}
x <- rnorm(100)
```
S3
Informal classes – a list() with a class attribute declaring the list to be a member of a class hierarchy. Single inheritance.
me = structure(list(first="Martin", last="Morgan"),
class="people")‘Generic’ and associated ‘methods’
print # generic
## function (x, ...)
## UseMethod("print")
## <bytecode: 0xfef9318>
## <environment: namespace:base>
print.people = function(x, ...)
cat("who:\n ", paste(x$first, x$last, "\n "))
me
## who:
## Martin Morgan
## R works on vectors, so model columns rather than rows!
team = function(name, first, last)
structure(list(name=name, first=first, last=last),
class=c("team", "people"))
print.team = function(x) {
cat("team name:", x$name,
"\nteam size:", length(x$first),
"\n")
NextMethod()
}
bioc = team("Bioconductor Core Team",
c("Martin", "Valerie", "Dan", "Herve"),
c("Morgan", "Obenchain", "Tenenbaum", "Pages"))
bioc
## team name: Bioconductor Core Team
## team size: 4
## who:
## Martin Morgan
## Valerie Obenchain
## Dan Tenenbaum
## Herve Pages
## S4
Much more complicated data structures possible
.People = setClass("People",
slots=c(first="character", last="character"),
validity=function(object) {
tst = character()
if (length(object@first) != length(object@last))
tst = c(tst, "'first' and 'last' must have same length")
## other validity checks...
if (length(tst)) tst else NULL
})
setMethod("show", "People", function(object) {
cat("who:\n ", paste(object@first, object@last, "\n "))
})
## [1] "show"
.Team = setClass("Team",
contains="People",
slots=c(name="character"))
Team = function(name, people) {
## constructor, then generate class
members = strsplit(people, "[[:space:]]+")
.Team(name=name,
first=sapply(members, `[[`, 1),
last=sapply(members, `[[`, 2))
}
setMethod("show", "Team", function(object) {
cat("team name:", object@name,
"\nteam size:", length(object@first),
"\n")
callNextMethod()
})
## [1] "show"
Team("Bioconductor Core Team",
c("Martin Morgan", "Valerie Obenchain", "Dan Tenenbaum",
"Herve Pages"))
## team name: Bioconductor Core Team
## team size: 4
## who:
## Martin Morgan
## Valerie Obenchain
## Dan Tenenbaum
## Herve Pages
## Advanced: what about R itself?
All accessible to the advanced R developer
R.home("include/Rinternals.h")
## [1] "/home/mtmorgan/bin/R-3-2-branch/include/Rinternals.h"Vector types are S-expressions that wrap metadata around base types
Different types
#define NILSXP 0 /* nil = NULL */
#define SYMSXP 1 /* symbols */
...
#define CLOSXP 3 /* closures */
...
#define LGLSXP 10 /* logical vectors */
#define INTSXP 13 /* integer vectors */
#define REALSXP 14 /* real variables */
#define CPLXSXP 15 /* complex variables */
...Basic data structure
struct sxpinfo_struct {
SEXPTYPE type : TYPE_BITS;
...
unsigned int named : 2; /* reference counter - 0, 1, >1
...
unsigned int gcgen : 1; /* old generation number */
...
}; /* Tot: 32 */
struct vecsxp_struct {
R_len_t length;
R_len_t truelength;
};
typedef struct VECTOR_SEXPREC {
SEXPREC_HEADER;
struct vecsxp_struct vecsxp;
} VECTOR_SEXPREC, *VECSEXP;Efficient R code
R is fast…
x <- runif(1000000)
system.time(log(x))
## user system elapsed
## 0.046 0.000 0.046…but it’s easy to fall off the wagon
library(microbenchmark)
f = function(n) {
y <- numeric()
for (i in seq_len(n))
y[i] = log(x[i])
y
}
microbenchmark(f(1000), f(10000), f(20000), times=10)
## Unit: milliseconds
## expr min lq mean median uq
## f(1000) 2.639494 2.716821 3.310221 2.771704 2.972985
## f(10000) 151.727198 152.250440 227.279112 153.954549 158.203007
## f(20000) 577.658969 584.645653 693.815064 588.491838 945.794270
## max neval cld
## 5.559355 10 a
## 523.111300 10 b
## 948.209794 10 cAlgorithms
for loops test all combinationsfindOverlaps()Many R programs have map-reduce-like semantics
FUN <- function(i) { Sys.sleep(1); i }
lapply(1:5, FUN)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4
##
## [[5]]
## [1] 5
system.time(lapply(1:5, FUN))
## user system elapsed
## 0.000 0.000 5.005Very easy to parallelize
library(parallel)
system.time({
mclapply(1:5, FUN, mc.cores=5)
})
## user system elapsed
## 0.009 0.044 1.025Clusters & clouds
R
z <- rnorm(1e6)
top_n <- function(x, n)
tail(order(x), n)
system.time(top_n(z, 100))
## user system elapsed
## 0.79 0.00 0.79Rcpp: priority queue (file “top_i_pq.cpp”)
#include <Rcpp.h>
#include <queue>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
IntegerVector top_i_pq(NumericVector v, int n)
{
typedef pair<double, int> Elt;
priority_queue< Elt, vector<Elt>, greater<Elt> > pq;
vector<int> result;
for (int i = 0; i != v.size(); ++i) {
if (pq.size() < n)
pq.push(Elt(v[i], i));
else {
Elt elt = Elt(v[i], i);
if (pq.top() < elt) {
pq.pop();
pq.push(elt);
}
}
}
result.reserve(pq.size());
while (!pq.empty()) {
result.push_back(pq.top().second + 1);
pq.pop();
}
return wrap(result);
}Compile (sourceCpp() is a party trick; instead build library and load in R)
sourceCpp("top_i_pq.cpp")
top_i_pq
## function (v, n)
## .Primitive(".Call")(<pointer: 0x7f501f3e13b0>, v, n)Run
z <- rnorm(1e6)
res1 <- top_n(z, 100)
res2 <- top_i_pq(z, 100)
identical(res1, res2)
## [1] TRUE
microbenchmark(top_n(z, 100), top_i_pq(z, 100), times=10)
## Unit: milliseconds
## expr min lq mean median uq
## top_n(z, 100) 683.298091 698.011092 732.242213 714.348213 772.386669
## top_i_pq(z, 100) 4.712946 5.012883 5.140092 5.083448 5.245147
## max neval cld
## 823.853972 10 b
## 5.593705 10 aBackground
Upstream
Reduction
Coordinates counts with row and column annotations, e.g., genomic ranges of each gene
## Coordinates of each known gene
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
## experiment data (subset) -- aligned reads; 1 file per indivdiual
library(RNAseqData.HNRNPC.bam.chr14)
library(Rsamtools)
fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
bfl <- BamFileList(fls, yieldSize=1000000)
## summarize reads overlapping each gene and individual
library(GenomicAlignments)
se <- summarizeOverlaps(genes, bfl)
head(assay(se)[rowSums(assay(se)) > 0,])
## ERR127306 ERR127307 ERR127308 ERR127309 ERR127302 ERR127303
## 10001 114 156 129 144 175 213
## 100128927 287 315 300 243 190 160
## 100129075 36 38 35 36 20 32
## 100129794 21 34 33 23 9 16
## 100288846 10 10 11 8 2 3
## 100289511 150 125 174 114 34 23
## ERR127304 ERR127305
## 10001 210 165
## 100128927 194 218
## 100129075 45 45
## 100129794 23 13
## 100288846 5 1
## 100289511 31 61Analysis
A more complete sample data set
library(airway)
data(airway)
## the count matrix
head(assay(airway))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003 679 448 873 408 1138
## ENSG00000000005 0 0 0 0 0
## ENSG00000000419 467 515 621 365 587
## ENSG00000000457 260 211 263 164 245
## ENSG00000000460 60 55 40 35 78
## ENSG00000000938 0 0 2 0 1
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003 1047 770 572
## ENSG00000000005 0 0 0
## ENSG00000000419 799 417 508
## ENSG00000000457 331 233 229
## ENSG00000000460 63 76 60
## ENSG00000000938 0 0 0
## description of the experiment (columns of the count matrix)
colData(airway)[, 1:3]
## DataFrame with 8 rows and 3 columns
## SampleName cell dex
## <factor> <factor> <factor>
## SRR1039508 GSM1275862 N61311 untrt
## SRR1039509 GSM1275863 N61311 trt
## SRR1039512 GSM1275866 N052611 untrt
## SRR1039513 GSM1275867 N052611 trt
## SRR1039516 GSM1275870 N080611 untrt
## SRR1039517 GSM1275871 N080611 trt
## SRR1039520 GSM1275874 N061011 untrt
## SRR1039521 GSM1275875 N061011 trtExploratory analysis, e.g., clustering to ask whether there are distinct groups, outliers, etc. Multidimensional scaling using euclidean distance between asinh-transformed counts
## transpose assay data (t), asinh transform, euclidean distance
## between samples (dist). Use classical multidimensional scaling for
## dimension reduction
cmd <- cmdscale(dist(asinh(t(assay(airway)))))
## solid circles (pch=20) at 4 times usual size (cex=4); col points based
## on dex treatment (col=airway$dex)
plot(cmd, col=airway$dex, pch=20, cex=4)
Assess each gene for differential expression. Advanced statistical model. Efficient implementation in C.
library(DESeq2)
airway <- airway[rowSums(assay(airway)) > 0,]
## specifiy experimental design
dds <- DESeqDataSet(airway, ~ cell + dex)
## 'normalize' columns for techincal artifacts; fit negative
## binomial model to each gene
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## summarize results
res <- results(dds)
res[head(order(res$padj)),]
## log2 fold change (MAP): dex untrt vs trt
## Wald test p-value: dex untrt vs trt
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000152583 997.4398 -4.316100 0.1724125 -25.03357
## ENSG00000165995 495.0929 -3.188698 0.1277441 -24.96160
## ENSG00000101347 12703.3871 -3.618232 0.1499441 -24.13054
## ENSG00000120129 3409.0294 -2.871326 0.1190334 -24.12202
## ENSG00000189221 2341.7673 -3.230629 0.1373643 -23.51870
## ENSG00000211445 12285.6151 -3.552999 0.1589971 -22.34631
## pvalue padj
## <numeric> <numeric>
## ENSG00000152583 2.636198e-138 4.888829e-134
## ENSG00000165995 1.598068e-137 1.481809e-133
## ENSG00000101347 1.195444e-128 6.808712e-125
## ENSG00000120129 1.468582e-128 6.808712e-125
## ENSG00000189221 2.625986e-122 9.739783e-119
## ENSG00000211445 1.311375e-110 4.053243e-107
plot(log2FoldChange ~ asinh(baseMean), res, pch=".",
cex=ifelse(res$padj < .001, 3, 1))
Summarize results
Making complicated configurations accessible
This position offers a challenging and creative opportunity for a talented and independent Web / Systems Administrator. The successful applicant will participate in many end-user-facing activities of the successful open-source Bioconductor project for the analysis and comprehension of high-throughput genomic data. Initial duties involve management of cloud-based computing resources, including our web site https://bioconductor.org and support facilities https://support.bioconductor.org. Responsibilities will grow to include day-to-day oversight of our software build system, as well as trouble-shooting and management of user-contributed packages. There are considerable opportunities for developing innovative modern, containerized (e.g., Docker, AMI), and cloud-based (commercial and in-house) solutions to enable use of our software, and to employ modern automation software to deploy and manage in-house computational resources. This person will work in our small team of on- and off-site team members. The size of our team means that the successful applicant will become an expert in these areas, relying on close collaboration with other team members for support. The successful applicant will of necessity become familiar with the R programming language and Bioconductor ecosystem, and should be comfortable with the challenges and opportunities that implies.
Interested? Contact martin.morgan@roswellpark.org
Bioconductor Core Team (Current and Recent)
Technical and Scientific Advisory Boards
Funding